Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-21 11:30 UTC (permalink / raw)
  To: Wols Lists
  Cc: Shaohua Li, NeilBrown, NeilBrown, linux-raid, Shaohua Li,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <58AB320B.1060707@youngman.org.uk>

On 2017/2/21 上午2:14, Wols Lists wrote:
> On 20/02/17 08:07, Coly Li wrote:
>> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..
> 
> Can you force gcc to inline or compile a function? Isn't it dangerous to
> rely on default behaviour and assume it won't change when the compiler
> is upgraded?

I choose to trust compiler, and trust the people behind gcc.

Coly


^ permalink raw reply

* Re: Trouble reassembling RAID10
From: Phil Turmel @ 2017-02-21 15:16 UTC (permalink / raw)
  To: Roger Roglans, linux-raid
In-Reply-To: <CAPXQET=6v3NTeujqtq69KfzAv1gpeG8vGQPfwMxxOaECvFR4sg@mail.gmail.com>

Hi Roger,

On 02/20/2017 04:42 PM, Roger Roglans wrote:
> Hey new to the mailing list and fairly new to RAIDs in general. I
> ran into an issue and was hoping someone could help.

We probably can.

> Our server that runs a 14 drive RAID10 through a rocketraid 2470 
> controller refused to assemble. Our goal is not necessarily to
> recover a working RAID, but to get as much data back as possible.

Amounts to the same thing.

> Maybe as a consequence of the assembly failure, upon shutting down
> the server, it would get stuck in boot loops. So I'm currently
> running Ubuntu 16.04.1 from a USB. I've determined that 2 of 14
> disks are faulty and have determined which ones they are.

Three.  Two have been faulty for a very long time.  No-one noticed
the degraded status.

> Here is the output of a mdadm --examine call.

Please re-do this, combined with smartctl, and without grep.  This
will tell us everything about your array.  Like so:

for x in /dev/sd[a-p] do mdadm -E ${x}1 ; smartctl -iA -l scterc $x ; done

Paste the output *inline* in your plain-text reply with line
wrapping disabled.  If your draft email is larger than 100k, split
into multiple emails.

You are likely to need an alternate bootable USB stick -- your
report sounds like one of the versions of mdadm that had a bug
in forced assembly.  I usually use the latest one from
https://www.system-rescue-cd.org/

Please also read the recent thread and its references starting
here: https://marc.info/?l=linux-raid&m=148755536616025&w=2

Phil

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Shaohua Li @ 2017-02-21 17:45 UTC (permalink / raw)
  To: Coly Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <f36e85b1-cca1-5be4-a171-d64057cc56a3@suse.de>

On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
> On 2017/2/21 上午8:29, NeilBrown wrote:
> > On Mon, Feb 20 2017, Coly Li wrote:
> > 
> >>> 在 2017年2月20日，下午3:04，Shaohua Li <shli@kernel.org> 写道：
> >>> 
> >>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> >>>>> On Mon, Feb 20 2017, NeilBrown wrote:
> >>>>> 
> >>>>>> On Fri, Feb 17 2017, Coly Li wrote:
> >>>>>> 
> >>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
> >>>>>>> going to change this as Shaohua wantsthe spitting to 
> >>>>>>> happen in a separate function, which I agree with, but
> >>>>>>> there is something else wrong here. Calling
> >>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
> >>>>>>> It is OK for simple devices, but when one request can
> >>>>>>> wait for another request to the same device it can 
> >>>>>>> deadlock. This can happen with raid1.  If a resync
> >>>>>>> request calls raise_barrier() between one request and
> >>>>>>> the next, then the next has to wait for the resync
> >>>>>>> request, which has to wait for the first request. As
> >>>>>>> the first request will be stuck in the queue in 
> >>>>>>> generic_make_request(), you get a deadlock.
> >>>>>> 
> >>>>>> For md raid1, queue in generic_make_request(), can I
> >>>>>> understand it as bio_list_on_stack in this function? And
> >>>>>> queue in underlying device, can I understand it as the
> >>>>>> data structures like plug->pending and 
> >>>>>> conf->pending_bio_list ?
> >>>>> 
> >>>>> Yes, the queue in generic_make_request() is the
> >>>>> bio_list_on_stack.  That is the only queue I am talking
> >>>>> about.  I'm not referring to plug->pending or
> >>>>> conf->pending_bio_list at all.
> >>>>> 
> >>>>>> 
> >>>>>> I still don't get the point of deadlock, let me try to
> >>>>>> explain why I don't see the possible deadlock. If a bio
> >>>>>> is split, and the first part is processed by
> >>>>>> make_request_fn(), and then a resync comes and it will 
> >>>>>> raise a barrier, there are 3 possible conditions, - the
> >>>>>> resync I/O tries to raise barrier on same bucket of the
> >>>>>> first regular bio. Then the resync task has to wait to
> >>>>>> the first bio drops its conf->nr_pending[idx]
> >>>>> 
> >>>>> Not quite. First, the resync task (in raise_barrier()) will
> >>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
> >>>>> happens immediately. Then the resync_task will increment
> >>>>> ->barrier[idx]. Only then will it wait for the first bio to
> >>>>> drop ->nr_pending[idx]. The processing of that first bio
> >>>>> will have submitted bios to the underlying device, and they
> >>>>> will be in the bio_list_on_stack queue, and will not be
> >>>>> processed until raid1_make_request() completes.
> >>>>> 
> >>>>> The loop in raid1_make_request() will then call
> >>>>> make_request_fn() which will call wait_barrier(), which
> >>>>> will wait for ->barrier[idx] to be zero.
> >>>> 
> >>>> Thinking more carefully about this.. the 'idx' that the
> >>>> second bio will wait for will normally be different, so there
> >>>> won't be a deadlock after all.
> >>>> 
> >>>> However it is possible for hash_long() to produce the same
> >>>> idx for two consecutive barrier_units so there is still the
> >>>> possibility of a deadlock, though it isn't as likely as I
> >>>> thought at first.
> >>> 
> >>> Wrapped the function pointer issue Neil pointed out into Coly's
> >>> original patch. Also fix a 'use-after-free' bug. For the
> >>> deadlock issue, I'll add below patch, please check.
> >>> 
> >>> Thanks, Shaohua
> >>> 
> >> 
> 
> Neil,
> 
> Thanks for your patient explanation, I feel I come to follow up what
> you mean. Let me try to re-tell what I understand, correct me if I am
> wrong.
> 
> 
> >> Hmm, please hold, I am still thinking of it. With barrier bucket
> >> and hash_long(), I don't see dead lock yet. For raid10 it might
> >> happen, but once we have barrier bucket on it , there will no
> >> deadlock.
> >> 
> >> My question is, this deadlock only happens when a big bio is
> >> split, and the split small bios are continuous, and the resync io
> >> visiting barrier buckets in sequntial order too. In the case if
> >> adjacent split regular bios or resync bios hit same barrier
> >> bucket, it will be a very big failure of hash design, and should
> >> have been found already. But no one complain it, so I don't
> >> convince myself tje deadlock is real with io barrier buckets
> >> (this is what Neil concerns).
> > 
> > I think you are wrong about the design goal of a hash function. 
> > When feed a sequence of inputs, with any stride (i.e. with any
> > constant difference between consecutive inputs), the output of the
> > hash function should appear to be random. A random sequence can
> > produce the same number twice in a row. If the hash function
> > produces a number from 0 to N-1, you would expect two consecutive
> > outputs to be the same about once every N inputs.
> > 
> 
> Yes, you are right. But when I mentioned hash conflict, I limit the
> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
> address divided by 64MB I/O barrier unit size, its value range is
> reduced to [0, 1<<38].
> 
> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
> 
> For DISCARD bio, its maximum size is 4GB, it could be split into 65
> bios at most.
> 
> Then in this patch, the hash question is degraded to: for any
> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
> these 65 integers into range [0, 1023], will any hash conflict happen
> among these integers ?
> 
> I tried a half range [0, 1<<37] to check hash conflict, by writing a
> simple code to emulate hash calculation in the new I/O barrier patch,
> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
> 1<<37] for hash conflict.
> 
> On a 20 core CPU each run spent 7+ hours, finally I find no hash
> conflict detected up to 512 consecutive integers in above limited
> condition. For 1024, there are a lot hash conflict detected.
> 
> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
> almost all existing md raid configuration. So for current kernel
> implementation and real world device, for a single bio, there is no
> possible hash conflict the new I/O barrier patch.
> 
> If bi_iter.bi_size changes from unsigned int to unsigned long in
> future, the above assumption will be wrong. There will be hash
> conflict, and potential dead lock, which is quite implicit. Yes, I
> agree with you. No, bio split inside loop is not perfect.
> 
> > Even if there was no possibility of a deadlock from a resync
> > request happening between two bios, there are other possibilities.
> > 
> 
> The bellowed text makes me know more about raid1 code, but confuses me
> more as well. Here comes my questions,
> 
> > It is not, in general, safe to call mempool_alloc() twice in a
> > row, without first ensuring that the first allocation will get
> > freed by some other thread.  raid1_write_request() allocates from
> > r1bio_pool, and then submits bios to the underlying device, which
> > get queued on bio_list_on_stack.  They will not be processed until
> > after raid1_make_request() completes, so when raid1_make_request
> > loops around and calls raid1_write_request() again, it will try to
> > allocate another r1bio from r1bio_pool, and this might end up
> > waiting for the r1bio which is trapped and cannot complete.
> > 
> 
> Can I say that it is because blk_finish_plug() won't be called before
> raid1_make_request() returns ? Then in raid1_write_request(), mbio
> will be added into plug->pending, but before blk_finish_plug() is
> called, they won't be handled.

blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
current->bio_list, not in plug list.
 
> > As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> > impossible.  If 256 threads all attempt a write (or read) that
> > crosses a boundary, then they will consume all 256 preallocated
> > entries, and want more. If there is no free memory, they will block
> > indefinitely.
> > 
> 
> If raid1_make_request() is modified into this way,
> +	if (bio_data_dir(split) == READ)
> +		raid1_read_request(mddev, split);
> +	else
> +		raid1_write_request(mddev, split);
> +	if (split != bio)
> +		generic_make_request(bio);
> 
> Then the original bio will be added into the bio_list_on_stack of top
> level generic_make_request(), current->bio_list is initialized, when
> generic_make_request() is called nested in raid1_make_request(), the
> split bio will be added into current->bio_list and nothing else happens.
> 
> After the nested generic_make_request() returns, the code back to next
> code of generic_make_request(),
> 2022                         ret = q->make_request_fn(q, bio);
> 2023
> 2024                         blk_queue_exit(q);
> 2025
> 2026                         bio = bio_list_pop(current->bio_list);
> 
> bio_list_pop() will return the second half of the split bio, and it is

So in above sequence, the curent->bio_list will has bios in below sequence:
bios to underlaying disks, second half of original bio

bio_list_pop will pop bios to underlaying disks first, handle them, then the
second half of original bio.

That said, this doesn't work for array stacked 3 layers. Because in 3-layer
array, handling the middle layer bio will make the 3rd layer bio hold to
bio_list again.

Thanks,
Shaohua

^ permalink raw reply

* Re: Reshape stalled at first badblock location (was: RAID 5 --assemble doesn't recognize all overlays as component devices)
From: Shaohua Li @ 2017-02-21 17:58 UTC (permalink / raw)
  To: George Rapp; +Cc: Linux-RAID, Matthew Krumwiede, neilb, Jes.Sorensen
In-Reply-To: <CAF-KpgY0ySvCN9ftbDmW_P6wDiyfN2yWE6=NECVru4=vCe+pbQ@mail.gmail.com>

On Mon, Feb 20, 2017 at 05:18:46PM -0500, George Rapp wrote:
> On Sat, Feb 11, 2017 at 7:32 PM, George Rapp <george.rapp@gmail.com> wrote:
> > Previous thread: http://marc.info/?l=linux-raid&m=148564798430138&w=2
> > -- to summarize, while adding two drives to a RAID 5 array, one of the
> > existing RAID 5 component drives failed, causing the reshape progress
> > to stall at 77.5%. I removed the previous thread from this message to
> > conserve space -- before resolving that situation, another problem has
> > arisen.
> >
> > We have cloned and replaced the failed /dev/sdg with "ddrescue --force
> > -r3 -n /dev/sdh /dev/sde c/sdh-sde-recovery.log"; copied in below, or
> > viewable via https://app.box.com/v/sdh-sde-recovery . The failing
> > device was removed from the server, and the RAID component partition
> > on the cloned drive is now /dev/sdg4.
> 
> [previous thread snipped - after stepping through the code under gdb,
> I realized that "mdadm --assemble --force" was needed.]
> 
> # uname -a
> Linux localhost 4.3.4-200.fc22.x86_64 #1 SMP Mon Jan 25 13:37:15 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux
> # mdadm --version
> mdadm - v3.3.4 - 3rd August 2015
> 
> As previously mentioned, the device that originally failed was cloned
> to a new drive. This copy included the bad blocks list from the md
> metadata, because I'm showing 23 bad blocks on the clone target drive,
> /dev/sdg4:
> 
> # mdadm --examine-badblocks /dev/sdg4
> Bad-blocks on /dev/sdg4:
>           3802454640 for 512 sectors
>           3802455664 for 512 sectors
>           3802456176 for 512 sectors
>           3802456688 for 512 sectors
>           3802457200 for 512 sectors
>           3802457712 for 512 sectors
>           3802458224 for 512 sectors
>           3802458736 for 512 sectors
>           3802459248 for 512 sectors
>           3802459760 for 512 sectors
>           3802460272 for 512 sectors
>           3802460784 for 512 sectors
>           3802461296 for 512 sectors
>           3802461808 for 512 sectors
>           3802462320 for 512 sectors
>           3802462832 for 512 sectors
>           3802463344 for 512 sectors
>           3802463856 for 512 sectors
>           3802464368 for 512 sectors
>           3802464880 for 512 sectors
>           3802465392 for 512 sectors
>           3802465904 for 512 sectors
>           3802466416 for 512 sectors
> 
> However, when I run the following command to attempt to read each of
> the bad blocks, no I/O errors pop up either on the command line or in
> /var/log messages:
> 
> # for i in $(mdadm --examine-badblocks /dev/sdg4 | grep "512 sectors"
> | cut -c11-20) ; do dd bs=512 if=/dev/sdg4 skip=$i count=512 | wc -c;
> done
> 
> I've truncated the output, but in each case it is similar to this:
> 
> 512+0 records in
> 512+0 records out
> 262144
> 262144 bytes (262 kB) copied, 0.636762 s, 412 kB/s
> 
> Thus, the bad blocks on the failed hard drive are apparently now
> readable on the cloned drive.
> 
> When I try to assemble the RAID 5 array, though, the process gets
> stuck at the location of the first bad block. The assemble command is:
> 
> # mdadm --assemble --force /dev/md4
> --backup-file=/home/gwr/2017/2017-01/md4_backup__2017-01-25 /dev/sde4
> /dev/sdf4 /dev/sdh4 /dev/sdl4 /dev/sdg4 /dev/sdk4 /dev/sdi4 /dev/sdj4
> /dev/sdb4 /dev/sdd4
> mdadm: accepting backup with timestamp 1485366772 for array with
> timestamp 1487624068
> mdadm: /dev/md4 has been started with 9 drives (out of 10).
> 
> The md4_raid5 process immediately spikes to 100% CPU utilization, and
> the reshape stops at 1901225472 KiB (which is exactly half of the
> first bad sector value, 3802454640):
> 
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md4 : active raid5 sde4[0] sdb4[12] sdj4[7] sdi4[8] sdk4[11] sdg4[10]
> sdl4[9] sdh4[2] sdf4[1]
>       13454923776 blocks super 1.1 level 5, 512k chunk, algorithm 2
> [10/9] [UUUUUUUUU_]
>       [===================>.]  reshape = 98.9% (1901225472/1922131968)
> finish=2780.9min speed=125K/sec
> 
> unused devices: <none>
> 
> Googling around, I get the impression that resetting the badblocks
> list is (a) not supported by the mdadm command; and (b) considered
> harmful. However, if the blocks aren't really bad any more, as they
> are now readable, does that risk still hold? How can I get this
> reshape to proceed?
> 
> Updated mdadm --examine output is at
> https://app.box.com/v/raid-status-2017-02-20

Add Neil and Jes.

Yes, there were similar reports before. When reshape finds nadblocks, the
reshape will do an infinite loop without any progress. I think there are two
things we need to do:

- Make reshape more robust. Maybe reshape should bail out if badblocks found.
- Add an option in mdadm to force reset badblocks

Thanks,
Shaohua

^ permalink raw reply

* Re: Device size for linux raid5 journal?
From: Shaohua Li @ 2017-02-21 18:03 UTC (permalink / raw)
  To: Christian Samsel; +Cc: linux-raid
In-Reply-To: <trinity-e6976ea9-ab41-4a2a-85ce-50b77ded8f35-1487609409044@3capp-gmx-bs16>

On Mon, Feb 20, 2017 at 05:50:09PM +0100, Christian Samsel wrote:
> Hello raid team,
> First of all, thanks for your work.
> So i recently read about Linux raid5/raid6 write-cache and journaling and thought about
> giving it a try. I'm mainly interested in the additional safety provided by the journal but might
> want be future proof to use the write cache as well.
> I read how to create an array using a journal but i havent found the slightest indication of how
> large the respective device/partition should be. I went through the lwn article [1], the slides [2] of the
> respective engineers at facebook and a few commit messages.
> 
> So my question is, let's assume i have a 6TB raid5 array (3x3TB), what would a good journal device size be?
> I'd probably went with 4GB, as this is kinda the upper bound of what hardware raid controller offer. 

Thanks for trying! Depending on write-through or write-back mode. For
write-through mode, the size could be just several hundreds megabytes. For
write-back mode, the size should be a little bigger, several gigabytes, but 4
GB should be enough.

Also I added doc about raid5-cache in
kernel_source/Documentation/md/raid5-cache.txt recently.

Thanks,
Shaohua

^ permalink raw reply

* Re: Trouble reassembling RAID10
From: Roger Roglans @ 2017-02-21 18:38 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid
In-Reply-To: <f01c3494-1191-fdf5-c32e-c60abcdc989e@turmel.org>

Hi Phil,

seems very useful to know in the future. I ended up just assuming
clean and using "--create". Since I was able to discern the exact
configurations, I was able to mount it and am currently transferring
data. I know it was not the ideal solution but I believe that it
worked out with only minimal corruption. I might have problems with
another array soon. If so, I will certainly contact this mailing list
again.

thanks for your help,

Roger

On Feb 21, 2017 9:16 AM, "Phil Turmel" <philip@turmel.org> wrote:

Hi Roger,

On 02/20/2017 04:42 PM, Roger Roglans wrote:
> Hey new to the mailing list and fairly new to RAIDs in general. I
> ran into an issue and was hoping someone could help.

We probably can.

> Our server that runs a 14 drive RAID10 through a rocketraid 2470
> controller refused to assemble. Our goal is not necessarily to
> recover a working RAID, but to get as much data back as possible.

Amounts to the same thing.

> Maybe as a consequence of the assembly failure, upon shutting down
> the server, it would get stuck in boot loops. So I'm currently
> running Ubuntu 16.04.1 from a USB. I've determined that 2 of 14
> disks are faulty and have determined which ones they are.

Three.  Two have been faulty for a very long time.  No-one noticed
the degraded status.

> Here is the output of a mdadm --examine call.

Please re-do this, combined with smartctl, and without grep.  This
will tell us everything about your array.  Like so:

for x in /dev/sd[a-p] do mdadm -E ${x}1 ; smartctl -iA -l scterc $x ; done

Paste the output *inline* in your plain-text reply with line
wrapping disabled.  If your draft email is larger than 100k, split
into multiple emails.

You are likely to need an alternate bootable USB stick -- your
report sounds like one of the versions of mdadm that had a bug
in forced assembly.  I usually use the latest one from
https://www.system-rescue-cd.org/

Please also read the recent thread and its references starting
here: https://marc.info/?l=linux-raid&m=148755536616025&w=2

Phil

^ permalink raw reply

* [PATCH] md/raid1: handle flush request correctly
From: Shaohua Li @ 2017-02-21 19:03 UTC (permalink / raw)
  To: linux-raid; +Cc: Coly Li, NeilBrown

I got a warning triggered in align_to_barrier_unit_end. It's a flush
request so sectors == 0. The flush request happens to work well without
the new barrier patch, but we'd better handle it explictly.

Cc: Coly Li <colyli@suse.de>
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 954d028..e1ee446 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1282,8 +1282,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 	unsigned long flags;
 	const int op = bio_op(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
-	const unsigned long do_flush_fua = (bio->bi_opf &
-						(REQ_PREFLUSH | REQ_FUA));
+	const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
 	struct md_rdev *blocked_rdev;
 	struct blk_plug_cb *cb;
 	struct raid1_plug_cb *plug = NULL;
@@ -1509,7 +1508,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 				   conf->mirrors[i].rdev->data_offset);
 		mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
 		mbio->bi_end_io	= raid1_end_write_request;
-		bio_set_op_attrs(mbio, op, do_flush_fua | do_sync);
+		bio_set_op_attrs(mbio, op, do_fua | do_sync);
 		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
 		    !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
 		    conf->raid_disks - mddev->degraded > 1)
@@ -1565,6 +1564,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 	struct bio *split;
 	sector_t sectors;
 
+	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
+		md_flush_request(mddev, bio);
+		return;
+	}
+
 	/* if bio exceeds barrier unit boundary, split it */
 	sectors = align_to_barrier_unit_end(
 			bio->bi_iter.bi_sector, bio_sectors(bio));
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Wols Lists @ 2017-02-21 19:20 UTC (permalink / raw)
  To: Coly Li
  Cc: Shaohua Li, NeilBrown, NeilBrown, linux-raid, Shaohua Li,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <e8ff590a-3972-465a-bf05-36f3e8542f0a@suse.de>

On 21/02/17 11:30, Coly Li wrote:
> On 2017/2/21 上午2:14, Wols Lists wrote:
>> On 20/02/17 08:07, Coly Li wrote:
>>> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..
>>
>> Can you force gcc to inline or compile a function? Isn't it dangerous to
>> rely on default behaviour and assume it won't change when the compiler
>> is upgraded?
> 
> I choose to trust compiler, and trust the people behind gcc.
> 
I admire your faith. I seem to remember several occasions where the gcc
people added new optimisations and caused all sorts of subtle havoc with
the kernel where it relied on the old behaviour. Don't forget - the
linux kernel is one of the compiler's most demanding customers. And
don't forget also - there are quite a few people now using llvm to
compile the kernel (it may not yet be working - I think it is certainly
for simple use cases) so tests on gcc don't guarantee it'll work for
everyone ...

I think you can trace the addition of many kernel compile-time flags to
that sort of thing - disabling new optimisations.

Cheers,
Wol


^ permalink raw reply

* Re: Trouble reassembling RAID10
From: Wols Lists @ 2017-02-21 19:33 UTC (permalink / raw)
  To: Roger Roglans, Phil Turmel; +Cc: linux-raid
In-Reply-To: <CAPXQETk=0j+2P-dex4GpPTmEyBpkUcqHt4WSKJnuNGZvuwY97g@mail.gmail.com>

On 21/02/17 18:38, Roger Roglans wrote:
> Hi Phil,
> 
> seems very useful to know in the future. I ended up just assuming
> clean and using "--create". Since I was able to discern the exact
> configurations, I was able to mount it and am currently transferring
> data. I know it was not the ideal solution but I believe that it
> worked out with only minimal corruption. I might have problems with
> another array soon. If so, I will certainly contact this mailing list
> again.

If I can plug my own work :-) there is now a section on the linux wiki
about troubleshooting an array, and what data to gather for the list.
Look at

https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn

That should be enough to fix simple problems or, for more serious ones,
you'll have the bulk if not all the information the members of the list
will need, saving the back-and-forth of "can we have this, can we have
that".

If you find any problems with the information on the wiki, let me know
and I'll endeavour to fix it.

Cheers,
Wol

^ permalink raw reply

* [PATCH v4 0/7] Partial Parity Log for MD RAID 5
From: Artur Paszkiewicz @ 2017-02-21 19:43 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

This series of patches implements the Partial Parity Log for RAID5 arrays. The
purpose of this feature is closing the RAID 5 Write Hole. It is a solution
alternative to the existing raid5-cache, but the logging workflow and much of
the implementation is based on it.

The main differences compared to raid5-cache is that PPL is a distributed log -
it is stored on array member drives in the metadata area and does not require a
dedicated journaling drive. Write performance is reduced by up to 30%-40% but
it scales with the number of drives in the array and the journaling drive does
not become a bottleneck or a single point of failure. PPL does not protect from
losing in-flight data, only from silent data corruption. More details about how
the log works can be found in patches 3 and 5.

This feature originated from Intel RSTe, which uses IMSM metadata. This
patchset implements PPL for external metadata (specifically IMSM) as well as
native MD v1.x metadata.

Changes in mdadm are also required to make this fully usable. Patches for mdadm
will be sent later.

v4:
- Separated raid5-cache and ppl structures.
- Removed the policy logic from raid5-cache, ppl calls moved to raid5 core.
- Checking wrong configuration when validating superblock.
- Moved documentation to separate file.
- More checks for ppl sector/size.
- Some small fixes and improvements.

v3:
- Fixed alignment issues in the metadata structures.
- Removed reading IMSM signature from superblock.
- Removed 'rwh_policy' and per-device JournalPpl flags, added
  'consistency_policy', 'ppl_sector' and 'ppl_size' sysfs attributes.
- Reworked and simplified disk removal logic.
- Debug messages in raid5-ppl.c converted to pr_debug().
- Fixed some bugs in logging and recovery code.
- Improved descriptions and documentation.

v2:
- Fixed wrong PPL size calculation for IMSM.
- Simplified full stripe write case.
- Removed direct access to bi_io_vec.
- Handle failed bio_add_page().

Artur Paszkiewicz (7):
  md: superblock changes for PPL
  raid5: calculate partial parity for a stripe
  raid5-ppl: Partial Parity Log write logging implementation
  md: add sysfs entries for PPL
  raid5-ppl: load and recover the log
  raid5-ppl: support disk hot add/remove with PPL
  raid5-ppl: runtime PPL enabling or disabling

 Documentation/admin-guide/md.rst |   32 +-
 Documentation/md/raid5-ppl.txt   |   44 ++
 drivers/md/Makefile              |    2 +-
 drivers/md/md.c                  |  140 +++++
 drivers/md/md.h                  |   10 +
 drivers/md/raid0.c               |    3 +-
 drivers/md/raid1.c               |    3 +-
 drivers/md/raid5-ppl.c           | 1151 ++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c               |  227 +++++++-
 drivers/md/raid5.h               |   19 +
 include/uapi/linux/raid/md_p.h   |   44 +-
 11 files changed, 1657 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/md/raid5-ppl.txt
 create mode 100644 drivers/md/raid5-ppl.c

-- 
2.11.0

^ permalink raw reply

* [PATCH v4 1/7] md: superblock changes for PPL
From: Artur Paszkiewicz @ 2017-02-21 19:43 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170221194401.18733-1-artur.paszkiewicz@intel.com>

Include information about PPL location and size into mdp_superblock_1
and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
to mddev->flags to indicate that PPL is enabled on an array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c                | 19 +++++++++++++++++++
 drivers/md/md.h                |  8 ++++++++
 drivers/md/raid0.c             |  3 ++-
 drivers/md/raid1.c             |  3 ++-
 include/uapi/linux/raid/md_p.h | 18 ++++++++++++++----
 5 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 55e7e7a8714e..c2028007b209 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1514,6 +1514,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 	} else if (sb->bblog_offset != 0)
 		rdev->badblocks.shift = 0;
 
+	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
+		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
+		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
+		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
+	}
+
 	if (!refdev) {
 		ret = 1;
 	} else {
@@ -1626,6 +1632,13 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 
 		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
 			set_bit(MD_HAS_JOURNAL, &mddev->flags);
+
+		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
+			if (le32_to_cpu(sb->feature_map) &
+			    (MD_FEATURE_BITMAP_OFFSET | MD_FEATURE_JOURNAL))
+				return -EINVAL;
+			set_bit(MD_HAS_PPL, &mddev->flags);
+		}
 	} else if (mddev->pers == NULL) {
 		/* Insist of good event counter while assembling, except for
 		 * spares (which don't need an event count) */
@@ -1839,6 +1852,12 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
 		sb->feature_map |= cpu_to_le32(MD_FEATURE_JOURNAL);
 
+	if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		sb->feature_map |= cpu_to_le32(MD_FEATURE_PPL);
+		sb->ppl.offset = cpu_to_le16(rdev->ppl.offset);
+		sb->ppl.size = cpu_to_le16(rdev->ppl.size);
+	}
+
 	rdev_for_each(rdev2, mddev) {
 		i = rdev2->desc_nr;
 		if (test_bit(Faulty, &rdev2->flags))
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b8859cbf84b6..66a5a16e79f7 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -122,6 +122,13 @@ struct md_rdev {
 					   * sysfs entry */
 
 	struct badblocks badblocks;
+
+	struct {
+		short offset;	/* Offset from superblock to start of PPL.
+				 * Not used by external metadata. */
+		unsigned int size;	/* Size in sectors of the PPL space */
+		sector_t sector;	/* First sector of the PPL space */
+	} ppl;
 };
 enum flag_bits {
 	Faulty,			/* device is known to have a fault */
@@ -229,6 +236,7 @@ enum mddev_flags {
 				 * supported as calls to md_error() will
 				 * never cause the array to become failed.
 				 */
+	MD_HAS_PPL,		/* The raid array has PPL feature set */
 };
 
 enum mddev_sb_flags {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index b3d264452fd5..1b6aa9a53149 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -29,7 +29,8 @@
 #define UNSUPPORTED_MDDEV_FLAGS		\
 	((1L << MD_HAS_JOURNAL) |	\
 	 (1L << MD_JOURNAL_CLEAN) |	\
-	 (1L << MD_FAILFAST_SUPPORTED))
+	 (1L << MD_FAILFAST_SUPPORTED) |\
+	 (1L << MD_HAS_PPL))
 
 static int raid0_congested(struct mddev *mddev, int bits)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2e5e4805cbe1..a95e08501619 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -44,7 +44,8 @@
 
 #define UNSUPPORTED_MDDEV_FLAGS		\
 	((1L << MD_HAS_JOURNAL) |	\
-	 (1L << MD_JOURNAL_CLEAN))
+	 (1L << MD_JOURNAL_CLEAN) |	\
+	 (1L << MD_HAS_PPL))
 
 /*
  * Number of guaranteed r1bios in case of extreme VM load:
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index 9930f3e9040f..fe2112810c43 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -242,10 +242,18 @@ struct mdp_superblock_1 {
 
 	__le32	chunksize;	/* in 512byte sectors */
 	__le32	raid_disks;
-	__le32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
-				 * NOTE: signed, so bitmap can be before superblock
-				 * only meaningful of feature_map[0] is set.
-				 */
+	union {
+		__le32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
+					 * NOTE: signed, so bitmap can be before superblock
+					 * only meaningful of feature_map[0] is set.
+					 */
+
+		/* only meaningful when feature_map[MD_FEATURE_PPL] is set */
+		struct {
+			__le16 offset; /* sectors from start of superblock that ppl starts (signed) */
+			__le16 size; /* ppl size in sectors */
+		} ppl;
+	};
 
 	/* These are only valid with feature bit '4' */
 	__le32	new_level;	/* new level we are reshaping to		*/
@@ -318,6 +326,7 @@ struct mdp_superblock_1 {
 					     */
 #define MD_FEATURE_CLUSTERED		256 /* clustered MD */
 #define	MD_FEATURE_JOURNAL		512 /* support write cache */
+#define	MD_FEATURE_PPL			1024 /* support PPL */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
@@ -328,6 +337,7 @@ struct mdp_superblock_1 {
 					|MD_FEATURE_RECOVERY_BITMAP	\
 					|MD_FEATURE_CLUSTERED		\
 					|MD_FEATURE_JOURNAL		\
+					|MD_FEATURE_PPL			\
 					)
 
 struct r5l_payload_header {
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 2/7] raid5: calculate partial parity for a stripe
From: Artur Paszkiewicz @ 2017-02-21 19:43 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170221194401.18733-1-artur.paszkiewicz@intel.com>

Attach a page for holding the partial parity data to stripe_head.
Allocate it only if mddev has the MD_HAS_PPL flag set.

Partial parity is the xor of not modified data chunks of a stripe and is
calculated as follows:

- reconstruct-write case:
  xor data from all not updated disks in a stripe

- read-modify-write case:
  xor old data and parity from all updated disks in a stripe

Implement it using the async_tx API and integrate into raid_run_ops().
It must be called when we still have access to old data, so do it when
STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
stored into sh->ppl_page.

Partial parity is not meaningful for full stripe write and is not stored
in the log or used for recovery, so don't attempt to calculate it when
stripe has STRIPE_FULL_WRITE.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.h |   3 ++
 2 files changed, 103 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7b7722bb2e8d..02e02fe5b04e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -463,6 +463,11 @@ static void shrink_buffers(struct stripe_head *sh)
 		sh->dev[i].page = NULL;
 		put_page(p);
 	}
+
+	if (sh->ppl_page) {
+		put_page(sh->ppl_page);
+		sh->ppl_page = NULL;
+	}
 }
 
 static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
@@ -479,6 +484,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
 		sh->dev[i].page = page;
 		sh->dev[i].orig_page = page;
 	}
+
+	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {
+		sh->ppl_page = alloc_page(gfp);
+		if (!sh->ppl_page)
+			return 1;
+	}
+
 	return 0;
 }
 
@@ -1974,6 +1986,55 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
 			   &sh->ops.zero_sum_result, percpu->spare_page, &submit);
 }
 
+static struct dma_async_tx_descriptor *
+ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
+		       struct dma_async_tx_descriptor *tx)
+{
+	int disks = sh->disks;
+	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
+	int count = 0, pd_idx = sh->pd_idx, i;
+	struct async_submit_ctl submit;
+
+	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
+
+	/*
+	 * Partial parity is the XOR of stripe data chunks that are not changed
+	 * during the write request. Depending on available data
+	 * (read-modify-write vs. reconstruct-write case) we calculate it
+	 * differently.
+	 */
+	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
+		/* rmw: xor old data and parity from updated disks */
+		for (i = disks; i--;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
+				xor_srcs[count++] = dev->page;
+		}
+	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
+		/* rcw: xor data from all not updated disks */
+		for (i = disks; i--;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_UPTODATE, &dev->flags))
+				xor_srcs[count++] = dev->page;
+		}
+	} else {
+		return tx;
+	}
+
+	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
+			  flex_array_get(percpu->scribble, 0)
+			  + sizeof(struct page *) * (sh->disks + 2));
+
+	if (count == 1)
+		tx = async_memcpy(sh->ppl_page, xor_srcs[0], 0, 0, PAGE_SIZE,
+				  &submit);
+	else
+		tx = async_xor(sh->ppl_page, xor_srcs, 0, count, PAGE_SIZE,
+			       &submit);
+
+	return tx;
+}
+
 static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 {
 	int overlap_clear = 0, i, disks = sh->disks;
@@ -2004,6 +2065,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 			async_tx_ack(tx);
 	}
 
+	if (test_bit(STRIPE_OP_PARTIAL_PARITY, &ops_request))
+		tx = ops_run_partial_parity(sh, percpu, tx);
+
 	if (test_bit(STRIPE_OP_PREXOR, &ops_request)) {
 		if (level < 6)
 			tx = ops_run_prexor5(sh, percpu, tx);
@@ -3079,6 +3143,12 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		s->locked++;
 	}
 
+	if (level == 5 && sh->ppl_page &&
+	    test_bit(STRIPE_OP_BIODRAIN, &s->ops_request) &&
+	    !test_bit(STRIPE_FULL_WRITE, &sh->state) &&
+	    test_bit(R5_Insync, &sh->dev[pd_idx].flags))
+		set_bit(STRIPE_OP_PARTIAL_PARITY, &s->ops_request);
+
 	pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
 		__func__, (unsigned long long)sh->sector,
 		s->locked, s->ops_request);
@@ -3126,6 +3196,36 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
 		goto overlap;
 
+	if (forwrite && sh->ppl_page) {
+		/*
+		 * With PPL only writes to consecutive data chunks within a
+		 * stripe are allowed because for a single stripe_head we can
+		 * only have one PPL entry at a time, which describes one data
+		 * range. Not really an overlap, but wait_for_overlap can be
+		 * used to handle this.
+		 */
+		sector_t sector;
+		sector_t first = 0;
+		sector_t last = 0;
+		int count = 0;
+		int i;
+
+		for (i = 0; i < sh->disks; i++) {
+			if (i != sh->pd_idx &&
+			    (i == dd_idx || sh->dev[i].towrite)) {
+				sector = sh->dev[i].sector;
+				if (count == 0 || sector < first)
+					first = sector;
+				if (sector > last)
+					last = sector;
+				count++;
+			}
+		}
+
+		if (first + conf->chunk_sectors * (count - 1) != last)
+			goto overlap;
+	}
+
 	if (!forwrite || previous)
 		clear_bit(STRIPE_BATCH_READY, &sh->state);
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4bb27b97bf6b..3cc4cb28f7e6 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -228,6 +228,8 @@ struct stripe_head {
 	struct list_head	log_list;
 	sector_t		log_start; /* first meta block on the journal */
 	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */
+
+	struct page		*ppl_page; /* partial parity of this stripe */
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -400,6 +402,7 @@ enum {
 	STRIPE_OP_BIODRAIN,
 	STRIPE_OP_RECONSTRUCT,
 	STRIPE_OP_CHECK,
+	STRIPE_OP_PARTIAL_PARITY,
 };
 
 /*
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: Artur Paszkiewicz @ 2017-02-21 19:43 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170221194401.18733-1-artur.paszkiewicz@intel.com>

This implements the PPL write logging functionality. The description
of PPL is added to the documentation. More details can be found in the
comments in raid5-ppl.c.

Put the PPL metadata structures to md_p.h because userspace tools
(mdadm) will also need to read/write PPL.

Warn about using PPL with enabled disk volatile write-back cache for
now. It can be removed once disk cache flushing before writing PPL is
implemented.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Documentation/md/raid5-ppl.txt |  44 +++
 drivers/md/Makefile            |   2 +-
 drivers/md/raid5-ppl.c         | 617 +++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c             |  49 +++-
 drivers/md/raid5.h             |   8 +
 include/uapi/linux/raid/md_p.h |  26 ++
 6 files changed, 738 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/md/raid5-ppl.txt
 create mode 100644 drivers/md/raid5-ppl.c

diff --git a/Documentation/md/raid5-ppl.txt b/Documentation/md/raid5-ppl.txt
new file mode 100644
index 000000000000..127072b09363
--- /dev/null
+++ b/Documentation/md/raid5-ppl.txt
@@ -0,0 +1,44 @@
+Partial Parity Log
+
+Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
+addressed by PPL is that after a dirty shutdown, parity of a particular stripe
+may become inconsistent with data on other member disks. If the array is also
+in degraded state, there is no way to recalculate parity, because one of the
+disks is missing. This can lead to silent data corruption when rebuilding the
+array or using it is as degraded - data calculated from parity for array blocks
+that have not been touched by a write request during the unclean shutdown can
+be incorrect. Such condition is known as the RAID5 Write Hole. Because of
+this, md by default does not allow starting a dirty degraded array.
+
+Partial parity for a write operation is the XOR of stripe data chunks not
+modified by this write. It is just enough data needed for recovering from the
+write hole. XORing partial parity with the modified chunks produces parity for
+the stripe, consistent with its state before the write operation, regardless of
+which chunk writes have completed. If one of the not modified data disks of
+this stripe is missing, this updated parity can be used to recover its
+contents. PPL recovery is also performed when starting an array after an
+unclean shutdown and all disks are available, eliminating the need to resync
+the array. Because of this, using write-intent bitmap and PPL together is not
+supported.
+
+When handling a write request PPL writes partial parity before new data and
+parity are dispatched to disks. PPL is a distributed log - it is stored on
+array member drives in the metadata area, on the parity drive of a particular
+stripe.  It does not require a dedicated journaling drive. Write performance is
+reduced by up to 30%-40% but it scales with the number of drives in the array
+and the journaling drive does not become a bottleneck or a single point of
+failure.
+
+Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
+not a true journal. It does not protect from losing in-flight data, only from
+silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
+performed for this stripe (parity is not updated). So it is possible to have
+arbitrary data in the written part of a stripe if that disk is lost. In such
+case the behavior is the same as in plain raid5.
+
+PPL is available for md version-1 metadata and external (specifically IMSM)
+metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
+
+Currently, volatile write-back cache should be disabled on all member drives
+when using PPL. Otherwise it cannot guarantee consistency in case of power
+failure.
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 3cbda1af87a0..4d48714ccc6b 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
 dm-era-y	+= dm-era-target.o
 dm-verity-y	+= dm-verity-target.o
 md-mod-y	+= md.o bitmap.o
-raid456-y	+= raid5.o raid5-cache.o
+raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
 
 # Note: link order is important.  All raid personalities
 # and must come before md.o, as they each initialise 
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
new file mode 100644
index 000000000000..a00cabf1adf6
--- /dev/null
+++ b/drivers/md/raid5-ppl.c
@@ -0,0 +1,617 @@
+/*
+ * Partial Parity Log for closing the RAID5 write hole
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include <linux/crc32c.h>
+#include <linux/raid/md_p.h>
+#include "md.h"
+#include "raid5.h"
+
+/*
+ * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
+ * partial parity data. The header contains an array of entries
+ * (struct ppl_header_entry) which describe the logged write requests.
+ * Partial parity for the entries comes after the header, written in the same
+ * sequence as the entries:
+ *
+ * Header
+ *   entry0
+ *   ...
+ *   entryN
+ * PP data
+ *   PP for entry0
+ *   ...
+ *   PP for entryN
+ *
+ * Every entry holds a checksum of its partial parity, the header also has a
+ * checksum of the header itself. Entries for full stripes writes contain no
+ * partial parity, they only mark the stripes for which parity should be
+ * recalculated after an unclean shutdown.
+ *
+ * A write request is always logged to the PPL instance stored on the parity
+ * disk of the corresponding stripe. For each member disk there is one ppl_log
+ * used to handle logging for this disk, independently from others. They are
+ * grouped in child_logs array in struct ppl_conf, which is assigned to
+ * r5conf->ppl and used in raid5 core.
+ *
+ * ppl_io_unit represents a full PPL write, header_page contains the ppl_header.
+ * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
+ * be appended to the last entry if the chunks to write are the same, otherwise
+ * a new entry is added. Checksums of entries are calculated incrementally as
+ * stripes containing partial parity are being added to entries.
+ * ppl_submit_iounit() calculates the checksum of the header and submits a bio
+ * containing the header page and partial parity pages (sh->ppl_page) for all
+ * stripes of the io_unit. When the PPL write completes, the stripes associated
+ * with the io_unit are released and raid5d starts writing their data and
+ * parity. When all stripes are written, the io_unit is freed and the next can
+ * be submitted.
+ *
+ * An io_unit is used to gather stripes until it is submitted or becomes full
+ * (if the maximum number of entries or size of PPL is reached). Another io_unit
+ * can't be submitted until the previous has completed (PPL and stripe
+ * data+parity is written). The log->io_list tracks all io_units of a log
+ * (for a single member disk). New io_units are added to the end of the list
+ * and the first io_unit is submitted, if it is not submitted already.
+ * The current io_unit accepting new stripes is always at the end of the list.
+ */
+
+struct ppl_conf {
+	struct mddev *mddev;
+
+	/* array of child logs, one for each raid disk */
+	struct ppl_log *child_logs;
+	int count;
+
+	int block_size;		/* the logical block size used for data_sector
+				 * in ppl_header_entry */
+	u32 signature;		/* raid array identifier */
+	atomic64_t seq;		/* current log write sequence number */
+
+	struct kmem_cache *io_kc;
+	mempool_t *io_pool;
+	struct bio_set *bs;
+	mempool_t *meta_pool;
+};
+
+struct ppl_log {
+	struct ppl_conf *ppl_conf;	/* shared between all log instances */
+
+	struct md_rdev *rdev;		/* array member disk associated with
+					 * this log instance */
+	struct mutex io_mutex;
+	struct ppl_io_unit *current_io;	/* current io_unit accepting new data
+					 * always at the end of io_list */
+	spinlock_t io_list_lock;
+	struct list_head io_list;	/* all io_units of this log */
+	struct list_head no_mem_stripes;/* stripes to retry if failed to
+					 * allocate io_unit */
+};
+
+struct ppl_io_unit {
+	struct ppl_log *log;
+
+	struct page *header_page;	/* for ppl_header */
+
+	unsigned int entries_count;	/* number of entries in ppl_header */
+	unsigned int pp_size;		/* total size current of partial parity */
+
+	u64 seq;			/* sequence number of this log write */
+	struct list_head log_sibling;	/* log->io_list */
+
+	struct list_head stripe_list;	/* stripes added to the io_unit */
+	atomic_t pending_stripes;	/* how many stripes not written to raid */
+
+	bool submitted;			/* true if write to log started */
+};
+
+static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
+					  struct stripe_head *sh)
+{
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct ppl_io_unit *io;
+	struct ppl_header *pplhdr;
+
+	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
+	if (!io)
+		return NULL;
+
+	memset(io, 0, sizeof(*io));
+	io->log = log;
+	INIT_LIST_HEAD(&io->log_sibling);
+	INIT_LIST_HEAD(&io->stripe_list);
+	atomic_set(&io->pending_stripes, 0);
+
+	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
+	pplhdr = page_address(io->header_page);
+	clear_page(pplhdr);
+	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
+	pplhdr->signature = cpu_to_le32(ppl_conf->signature);
+
+	io->seq = atomic64_add_return(1, &ppl_conf->seq);
+	pplhdr->generation = cpu_to_le64(io->seq);
+
+	return io;
+}
+
+static int ppl_log_stripe(struct ppl_log *log, struct stripe_head *sh)
+{
+	struct ppl_io_unit *io = log->current_io;
+	struct ppl_header_entry *e = NULL;
+	struct ppl_header *pplhdr;
+	int i;
+	sector_t data_sector = 0;
+	int data_disks = 0;
+	unsigned int entry_space = (log->rdev->ppl.size << 9) - PPL_HEADER_SIZE;
+	struct r5conf *conf = sh->raid_conf;
+
+	pr_debug("%s: stripe: %llu\n", __func__, (unsigned long long)sh->sector);
+
+	/* check if current io_unit is full */
+	if (io && (io->pp_size == entry_space ||
+		   io->entries_count == PPL_HDR_MAX_ENTRIES)) {
+		pr_debug("%s: add io_unit blocked by seq: %llu\n",
+			 __func__, io->seq);
+		io = NULL;
+	}
+
+	/* add a new unit if there is none or the current is full */
+	if (!io) {
+		io = ppl_new_iounit(log, sh);
+		if (!io)
+			return -ENOMEM;
+		spin_lock_irq(&log->io_list_lock);
+		list_add_tail(&io->log_sibling, &log->io_list);
+		spin_unlock_irq(&log->io_list_lock);
+
+		log->current_io = io;
+	}
+
+	for (i = 0; i < sh->disks; i++) {
+		struct r5dev *dev = &sh->dev[i];
+
+		if (i != sh->pd_idx && test_bit(R5_Wantwrite, &dev->flags)) {
+			if (!data_disks || dev->sector < data_sector)
+				data_sector = dev->sector;
+			data_disks++;
+		}
+	}
+	BUG_ON(!data_disks);
+
+	pr_debug("%s: seq: %llu data_sector: %llu data_disks: %d\n", __func__,
+		 io->seq, (unsigned long long)data_sector, data_disks);
+
+	pplhdr = page_address(io->header_page);
+
+	if (io->entries_count > 0) {
+		struct ppl_header_entry *last =
+				&pplhdr->entries[io->entries_count - 1];
+		u64 data_sector_last = le64_to_cpu(last->data_sector);
+		u32 data_size_last = le32_to_cpu(last->data_size);
+		u32 pp_size_last = le32_to_cpu(last->pp_size);
+
+		/*
+		 * Check if we can merge with the last entry. Must be on
+		 * the same stripe and disks. Use bit shift and logarithm
+		 * to avoid 64-bit division.
+		 */
+		if ((data_sector >> ilog2(conf->chunk_sectors) ==
+		     data_sector_last >> ilog2(conf->chunk_sectors)) &&
+		    ((pp_size_last == 0 &&
+		      test_bit(STRIPE_FULL_WRITE, &sh->state)) ||
+		     ((data_sector_last + (pp_size_last >> 9) == data_sector) &&
+		      (data_size_last == pp_size_last * data_disks))))
+			e = last;
+	}
+
+	if (!e) {
+		e = &pplhdr->entries[io->entries_count++];
+		e->data_sector = cpu_to_le64(data_sector);
+		e->parity_disk = cpu_to_le32(sh->pd_idx);
+		e->checksum = cpu_to_le32(~0);
+	}
+
+	le32_add_cpu(&e->data_size, data_disks << PAGE_SHIFT);
+
+	/* don't write any PP if full stripe write */
+	if (!test_bit(STRIPE_FULL_WRITE, &sh->state)) {
+		le32_add_cpu(&e->pp_size, PAGE_SIZE);
+		io->pp_size += PAGE_SIZE;
+		e->checksum = cpu_to_le32(crc32c_le(le32_to_cpu(e->checksum),
+						    page_address(sh->ppl_page),
+						    PAGE_SIZE));
+	}
+
+	list_add_tail(&sh->log_list, &io->stripe_list);
+	atomic_inc(&io->pending_stripes);
+	sh->ppl_log_io = io;
+
+	return 0;
+}
+
+int ppl_write_stripe(struct ppl_conf *ppl_conf, struct stripe_head *sh)
+{
+	struct ppl_log *log;
+	struct ppl_io_unit *io = sh->ppl_log_io;
+
+	if (io || test_bit(STRIPE_SYNCING, &sh->state) ||
+	    !test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags) ||
+	    !test_bit(R5_Insync, &sh->dev[sh->pd_idx].flags)) {
+		clear_bit(STRIPE_LOG_TRAPPED, &sh->state);
+		return -EAGAIN;
+	}
+
+	log = &ppl_conf->child_logs[sh->pd_idx];
+
+	mutex_lock(&log->io_mutex);
+
+	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
+		mutex_unlock(&log->io_mutex);
+		return -EAGAIN;
+	}
+
+	set_bit(STRIPE_LOG_TRAPPED, &sh->state);
+	clear_bit(STRIPE_DELAYED, &sh->state);
+	atomic_inc(&sh->count);
+
+	if (ppl_log_stripe(log, sh)) {
+		spin_lock_irq(&log->io_list_lock);
+		list_add_tail(&sh->log_list, &log->no_mem_stripes);
+		spin_unlock_irq(&log->io_list_lock);
+	}
+
+	mutex_unlock(&log->io_mutex);
+
+	return 0;
+}
+
+static void ppl_log_endio(struct bio *bio)
+{
+	struct ppl_io_unit *io = bio->bi_private;
+	struct ppl_log *log = io->log;
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct stripe_head *sh, *next;
+
+	pr_debug("%s: seq: %llu\n", __func__, io->seq);
+
+	if (bio->bi_error)
+		md_error(ppl_conf->mddev, log->rdev);
+
+	bio_put(bio);
+	mempool_free(io->header_page, ppl_conf->meta_pool);
+
+	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
+		list_del_init(&sh->log_list);
+
+		set_bit(STRIPE_HANDLE, &sh->state);
+		raid5_release_stripe(sh);
+	}
+}
+
+static void ppl_submit_iounit(struct ppl_io_unit *io)
+{
+	struct ppl_log *log = io->log;
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct r5conf *conf = ppl_conf->mddev->private;
+	struct ppl_header *pplhdr = page_address(io->header_page);
+	struct bio *bio;
+	struct stripe_head *sh;
+	int i;
+	struct bio_list bios = BIO_EMPTY_LIST;
+	char b[BDEVNAME_SIZE];
+
+	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, ppl_conf->bs);
+	bio->bi_private = io;
+	bio->bi_end_io = ppl_log_endio;
+	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
+	bio->bi_bdev = log->rdev->bdev;
+	bio->bi_iter.bi_sector = log->rdev->ppl.sector;
+	bio_add_page(bio, io->header_page, PAGE_SIZE, 0);
+	bio_list_add(&bios, bio);
+
+	sh = list_first_entry(&io->stripe_list, struct stripe_head, log_list);
+
+	for (i = 0; i < io->entries_count; i++) {
+		struct ppl_header_entry *e = &pplhdr->entries[i];
+		u32 pp_size = le32_to_cpu(e->pp_size);
+		u32 data_size = le32_to_cpu(e->data_size);
+		u64 data_sector = le64_to_cpu(e->data_sector);
+		int stripes_count;
+
+		if (pp_size > 0)
+			stripes_count = pp_size >> PAGE_SHIFT;
+		else
+			stripes_count = (data_size /
+					 (conf->raid_disks -
+					  conf->max_degraded)) >> PAGE_SHIFT;
+
+		while (stripes_count--) {
+			/*
+			 * if entry without partial parity just skip its stripes
+			 * without adding pages to bio
+			 */
+			if (pp_size > 0 &&
+			    !bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0)) {
+				struct bio *prev = bio;
+
+				bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES,
+						       ppl_conf->bs);
+				bio->bi_opf = prev->bi_opf;
+				bio->bi_bdev = prev->bi_bdev;
+				bio->bi_iter.bi_sector = bio_end_sector(prev);
+				bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
+				bio_chain(bio, prev);
+				bio_list_add(&bios, bio);
+			}
+			sh = list_next_entry(sh, log_list);
+		}
+
+		pr_debug("%s: seq: %llu entry: %d data_sector: %llu pp_size: %u data_size: %u\n",
+			 __func__, io->seq, i, data_sector, pp_size, data_size);
+
+		e->data_sector = cpu_to_le64(data_sector >>
+					     ilog2(ppl_conf->block_size >> 9));
+		e->checksum = cpu_to_le32(~le32_to_cpu(e->checksum));
+	}
+
+	pplhdr->entries_count = cpu_to_le32(io->entries_count);
+	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PPL_HEADER_SIZE));
+
+	while ((bio = bio_list_pop(&bios))) {
+		pr_debug("%s: seq: %llu submit_bio() size: %u sector: %llu dev: %s\n",
+			 __func__, io->seq, bio->bi_iter.bi_size,
+			 (unsigned long long)bio->bi_iter.bi_sector,
+			 bdevname(bio->bi_bdev, b));
+		submit_bio(bio);
+	}
+}
+
+static void ppl_submit_current_io(struct ppl_log *log)
+{
+	struct ppl_io_unit *io;
+
+	spin_lock_irq(&log->io_list_lock);
+
+	io = list_first_entry_or_null(&log->io_list, struct ppl_io_unit,
+				      log_sibling);
+	if (io && io->submitted)
+		io = NULL;
+
+	spin_unlock_irq(&log->io_list_lock);
+
+	if (io) {
+		io->submitted = true;
+
+		if (io == log->current_io)
+			log->current_io = NULL;
+
+		ppl_submit_iounit(io);
+	}
+}
+
+void ppl_write_stripe_run(struct ppl_conf *ppl_conf)
+{
+	struct ppl_log *log;
+	int i;
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		log = &ppl_conf->child_logs[i];
+
+		mutex_lock(&log->io_mutex);
+		ppl_submit_current_io(log);
+		mutex_unlock(&log->io_mutex);
+	}
+}
+
+static void ppl_io_unit_finished(struct ppl_io_unit *io)
+{
+	struct ppl_log *log = io->log;
+	unsigned long flags;
+
+	pr_debug("%s: seq: %llu\n", __func__, io->seq);
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+
+	list_del(&io->log_sibling);
+	mempool_free(io, log->ppl_conf->io_pool);
+
+	if (!list_empty(&log->no_mem_stripes)) {
+		struct stripe_head *sh = list_first_entry(&log->no_mem_stripes,
+							  struct stripe_head,
+							  log_list);
+		list_del_init(&sh->log_list);
+		set_bit(STRIPE_HANDLE, &sh->state);
+		raid5_release_stripe(sh);
+	}
+
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+}
+
+void ppl_stripe_write_finished(struct stripe_head *sh)
+{
+	struct ppl_io_unit *io;
+
+	io = sh->ppl_log_io;
+	sh->ppl_log_io = NULL;
+
+	if (io && atomic_dec_and_test(&io->pending_stripes))
+		ppl_io_unit_finished(io);
+}
+
+void ppl_exit_log(struct ppl_conf *ppl_conf)
+{
+	kfree(ppl_conf->child_logs);
+
+	mempool_destroy(ppl_conf->meta_pool);
+	if (ppl_conf->bs)
+		bioset_free(ppl_conf->bs);
+	mempool_destroy(ppl_conf->io_pool);
+	kmem_cache_destroy(ppl_conf->io_kc);
+
+	kfree(ppl_conf);
+}
+
+static int ppl_validate_rdev(struct md_rdev *rdev)
+{
+	char b[BDEVNAME_SIZE];
+	int ppl_data_sectors;
+	int ppl_size_new;
+
+	/*
+	 * The configured PPL size must be enough to store
+	 * the header and (at the very least) partial parity
+	 * for one stripe. Round it down to ensure the data
+	 * space is cleanly divisible by stripe size.
+	 */
+	ppl_data_sectors = rdev->ppl.size - (PPL_HEADER_SIZE >> 9);
+
+	if (ppl_data_sectors > 0)
+		ppl_data_sectors = rounddown(ppl_data_sectors, STRIPE_SECTORS);
+
+	if (ppl_data_sectors <= 0) {
+		pr_warn("md/raid:%s: PPL space too small on %s\n",
+			mdname(rdev->mddev), bdevname(rdev->bdev, b));
+		return -ENOSPC;
+	}
+
+	ppl_size_new = ppl_data_sectors + (PPL_HEADER_SIZE >> 9);
+
+	if ((rdev->ppl.sector < rdev->data_offset &&
+	     rdev->ppl.sector + ppl_size_new > rdev->data_offset) ||
+	    (rdev->ppl.sector >= rdev->data_offset &&
+	     rdev->data_offset + rdev->sectors > rdev->ppl.sector)) {
+		pr_warn("md/raid:%s: PPL space overlaps with data on %s\n",
+			mdname(rdev->mddev), bdevname(rdev->bdev, b));
+		return -EINVAL;
+	}
+
+	if (!rdev->mddev->external &&
+	    ((rdev->ppl.offset > 0 && rdev->ppl.offset < (rdev->sb_size >> 9)) ||
+	     (rdev->ppl.offset <= 0 && rdev->ppl.offset + ppl_size_new > 0))) {
+		pr_warn("md/raid:%s: PPL space overlaps with superblock on %s\n",
+			mdname(rdev->mddev), bdevname(rdev->bdev, b));
+		return -EINVAL;
+	}
+
+	rdev->ppl.size = ppl_size_new;
+
+	return 0;
+}
+
+int ppl_init_log(struct r5conf *conf)
+{
+	struct ppl_conf *ppl_conf;
+	struct mddev *mddev = conf->mddev;
+	int ret = 0;
+	int i;
+	bool need_cache_flush;
+
+	if (PAGE_SIZE != 4096)
+		return -EINVAL;
+
+	if (mddev->bitmap_info.file || mddev->bitmap_info.offset) {
+		pr_warn("md/raid:%s PPL is not compatible with bitmap\n",
+			mdname(mddev));
+		return -EINVAL;
+	}
+
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		pr_warn("md/raid:%s PPL is not compatible with journal\n",
+			mdname(mddev));
+		return -EINVAL;
+	}
+
+	ppl_conf = kzalloc(sizeof(struct ppl_conf), GFP_KERNEL);
+	if (!ppl_conf)
+		return -ENOMEM;
+
+	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
+	if (!ppl_conf->io_kc) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
+	if (!ppl_conf->io_pool) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
+	if (!ppl_conf->bs) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
+	if (!ppl_conf->meta_pool) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->mddev = mddev;
+	ppl_conf->count = conf->raid_disks;
+	ppl_conf->child_logs = kcalloc(ppl_conf->count, sizeof(struct ppl_log),
+				       GFP_KERNEL);
+	if (!ppl_conf->child_logs) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	atomic64_set(&ppl_conf->seq, 0);
+
+	if (!mddev->external) {
+		ppl_conf->signature = ~crc32c_le(~0, mddev->uuid, sizeof(mddev->uuid));
+		ppl_conf->block_size = 512;
+	} else {
+		ppl_conf->block_size = queue_logical_block_size(mddev->queue);
+	}
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		struct ppl_log *log = &ppl_conf->child_logs[i];
+		struct md_rdev *rdev = conf->disks[i].rdev;
+
+		mutex_init(&log->io_mutex);
+		spin_lock_init(&log->io_list_lock);
+		INIT_LIST_HEAD(&log->io_list);
+		INIT_LIST_HEAD(&log->no_mem_stripes);
+
+		log->ppl_conf = ppl_conf;
+		log->rdev = rdev;
+
+		if (rdev) {
+			struct request_queue *q;
+
+			ret = ppl_validate_rdev(rdev);
+			if (ret)
+				goto err;
+
+			q = bdev_get_queue(rdev->bdev);
+			if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+				need_cache_flush = true;
+		}
+	}
+
+	if (need_cache_flush)
+		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
+			mdname(mddev));
+
+	conf->ppl = ppl_conf;
+
+	return 0;
+err:
+	ppl_exit_log(ppl_conf);
+	return ret;
+}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 02e02fe5b04e..21440b594878 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -739,7 +739,7 @@ static bool stripe_can_batch(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 
-	if (conf->log)
+	if (conf->log || conf->ppl)
 		return false;
 	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
 		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
@@ -936,6 +936,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 		}
 	}
 
+	if (conf->ppl && ppl_write_stripe(conf->ppl, sh) == 0)
+		return;
+
 	for (i = disks; i--; ) {
 		int op, op_flags = 0;
 		int replace_only = 0;
@@ -3308,6 +3311,16 @@ static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
 			     &dd_idx, sh);
 }
 
+static void log_stripe_write_finished(struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+
+	if (conf->log)
+		r5l_stripe_write_finished(sh);
+	else if (conf->ppl)
+		ppl_stripe_write_finished(sh);
+}
+
 static void
 handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 				struct stripe_head_state *s, int disks,
@@ -3347,7 +3360,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 		if (bi)
 			bitmap_end = 1;
 
-		r5l_stripe_write_finished(sh);
+		log_stripe_write_finished(sh);
 
 		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 			wake_up(&conf->wait_for_overlap);
@@ -3766,7 +3779,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 				discard_pending = 1;
 		}
 
-	r5l_stripe_write_finished(sh);
+	log_stripe_write_finished(sh);
 
 	if (!discard_pending &&
 	    test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags)) {
@@ -4756,7 +4769,7 @@ static void handle_stripe(struct stripe_head *sh)
 
 	if (s.just_cached)
 		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
-	r5l_stripe_write_finished(sh);
+	log_stripe_write_finished(sh);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
@@ -6120,6 +6133,14 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	return handled;
 }
 
+static void log_write_stripe_run(struct r5conf *conf)
+{
+	if (conf->log)
+		r5l_write_stripe_run(conf->log);
+	else if (conf->ppl)
+		ppl_write_stripe_run(conf->ppl);
+}
+
 static int handle_active_stripes(struct r5conf *conf, int group,
 				 struct r5worker *worker,
 				 struct list_head *temp_inactive_list)
@@ -6157,7 +6178,7 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 
 	for (i = 0; i < batch_size; i++)
 		handle_stripe(batch[i]);
-	r5l_write_stripe_run(conf->log);
+	log_write_stripe_run(conf);
 
 	cond_resched();
 
@@ -6735,6 +6756,8 @@ static void free_conf(struct r5conf *conf)
 
 	if (conf->log)
 		r5l_exit_log(conf->log);
+	if (conf->ppl)
+		ppl_exit_log(conf->ppl);
 	if (conf->shrinker.nr_deferred)
 		unregister_shrinker(&conf->shrinker);
 
@@ -7196,6 +7219,13 @@ static int raid5_run(struct mddev *mddev)
 		BUG_ON(mddev->delta_disks != 0);
 	}
 
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
+	    test_bit(MD_HAS_PPL, &mddev->flags)) {
+		pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
+			mdname(mddev));
+		clear_bit(MD_HAS_PPL, &mddev->flags);
+	}
+
 	if (mddev->private == NULL)
 		conf = setup_conf(mddev);
 	else
@@ -7422,6 +7452,11 @@ static int raid5_run(struct mddev *mddev)
 			 mdname(mddev), bdevname(journal_dev->bdev, b));
 		if (r5l_init_log(conf, journal_dev))
 			goto abort;
+	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		pr_debug("md/raid:%s: enabling distributed Partial Parity Log\n",
+			 mdname(mddev));
+		if (ppl_init_log(conf))
+			goto abort;
 	}
 
 	return 0;
@@ -7690,7 +7725,7 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
 	sector_t newsize;
 	struct r5conf *conf = mddev->private;
 
-	if (conf->log)
+	if (conf->log || conf->ppl)
 		return -EINVAL;
 	sectors &= ~((sector_t)conf->chunk_sectors - 1);
 	newsize = raid5_size(mddev, sectors, mddev->raid_disks);
@@ -7743,7 +7778,7 @@ static int check_reshape(struct mddev *mddev)
 {
 	struct r5conf *conf = mddev->private;
 
-	if (conf->log)
+	if (conf->log || conf->ppl)
 		return -EINVAL;
 	if (mddev->delta_disks == 0 &&
 	    mddev->new_layout == mddev->layout &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 3cc4cb28f7e6..f915a7a0e752 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -230,6 +230,7 @@ struct stripe_head {
 	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */
 
 	struct page		*ppl_page; /* partial parity of this stripe */
+	struct ppl_io_unit	*ppl_log_io;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -689,6 +690,7 @@ struct r5conf {
 	int			group_cnt;
 	int			worker_cnt_per_group;
 	struct r5l_log		*log;
+	struct ppl_conf		*ppl;
 
 	struct bio_list		pending_bios;
 	spinlock_t		pending_bios_lock;
@@ -798,4 +800,10 @@ extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 extern struct md_sysfs_entry r5c_journal_mode;
 extern void r5c_update_on_rdev_error(struct mddev *mddev);
 extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+
+extern int ppl_init_log(struct r5conf *conf);
+extern void ppl_exit_log(struct ppl_conf *log);
+extern int ppl_write_stripe(struct ppl_conf *log, struct stripe_head *sh);
+extern void ppl_write_stripe_run(struct ppl_conf *log);
+extern void ppl_stripe_write_finished(struct stripe_head *sh);
 #endif
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index fe2112810c43..2c28711cc5f1 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -398,4 +398,30 @@ struct r5l_meta_block {
 
 #define R5LOG_VERSION 0x1
 #define R5LOG_MAGIC 0x6433c509
+
+struct ppl_header_entry {
+	__le64 data_sector;	/* Raid sector of the new data */
+	__le32 pp_size;		/* Length of partial parity */
+	__le32 data_size;	/* Length of data */
+	__le32 parity_disk;	/* Member disk containing parity */
+	__le32 checksum;	/* Checksum of this entry */
+} __attribute__ ((__packed__));
+
+#define PPL_HEADER_SIZE 4096
+#define PPL_HDR_RESERVED 512
+#define PPL_HDR_ENTRY_SPACE \
+	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))
+#define PPL_HDR_MAX_ENTRIES \
+	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
+
+struct ppl_header {
+	__u8 reserved[PPL_HDR_RESERVED];/* Reserved space */
+	__le32 signature;		/* Signature (family number of volume) */
+	__le32 padding;
+	__le64 generation;		/* Generation number of PP Header */
+	__le32 entries_count;		/* Number of entries in entry array */
+	__le32 checksum;		/* Checksum of PP Header */
+	struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
+} __attribute__ ((__packed__));
+
 #endif
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 4/7] md: add sysfs entries for PPL
From: Artur Paszkiewicz @ 2017-02-21 19:43 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170221194401.18733-1-artur.paszkiewicz@intel.com>

Add 'consistency_policy' attribute for array. It indicates how the array
maintains consistency in case of unexpected shutdown.

Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
and size of the PPL space on the device. They can't be changed for
active members if the array is started and PPL is enabled, so in the
setter functions only basic checks are performed. More checks are done
in ppl_validate_rdev() when starting the log.

These attributes are writable to allow enabling PPL for external
metadata arrays and (later) to enable/disable PPL for a running array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Documentation/admin-guide/md.rst |  32 ++++++++++-
 drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 144 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 1e61bf50595c..84de718f24a4 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -276,14 +276,14 @@ All md devices contain:
      array creation it will default to 0, though starting the array as
      ``clean`` will set it much larger.
 
-   new_dev
+  new_dev
      This file can be written but not read.  The value written should
      be a block device number as major:minor.  e.g. 8:0
      This will cause that device to be attached to the array, if it is
      available.  It will then appear at md/dev-XXX (depending on the
      name of the device) and further configuration is then possible.
 
-   safe_mode_delay
+  safe_mode_delay
      When an md array has seen no write requests for a certain period
      of time, it will be marked as ``clean``.  When another write
      request arrives, the array is marked as ``dirty`` before the write
@@ -292,7 +292,7 @@ All md devices contain:
      period as a number of seconds.  The default is 200msec (0.200).
      Writing a value of 0 disables safemode.
 
-   array_state
+  array_state
      This file contains a single word which describes the current
      state of the array.  In many cases, the state can be set by
      writing the word for the desired state, however some states
@@ -401,7 +401,30 @@ All md devices contain:
      once the array becomes non-degraded, and this fact has been
      recorded in the metadata.
 
+  consistency_policy
+     This indicates how the array maintains consistency in case of unexpected
+     shutdown. It can be:
 
+     none
+       Array has no redundancy information, e.g. raid0, linear.
+
+     resync
+       Full resync is performed and all redundancy is regenerated when the
+       array is started after unclean shutdown.
+
+     bitmap
+       Resync assisted by a write-intent bitmap.
+
+     journal
+       For raid4/5/6, journal device is used to log transactions and replay
+       after unclean shutdown.
+
+     ppl
+       For raid5 only, Partial Parity Log is used to close the write hole and
+       eliminate resync.
+
+     The accepted values when writing to this file are ``ppl`` and ``resync``,
+     used to enable and disable PPL.
 
 
 As component devices are added to an md array, they appear in the ``md``
@@ -563,6 +586,9 @@ Each directory contains:
 	adds bad blocks without acknowledging them. This is largely
 	for testing.
 
+      ppl_sector, ppl_size
+        Location and size (in sectors) of the space used for Partial Parity Log
+        on this device.
 
 
 An active md device will also contain an entry for each active device
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c2028007b209..3ff979d538d4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3157,6 +3157,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
 static struct rdev_sysfs_entry rdev_unack_bad_blocks =
 __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
 
+static ssize_t
+ppl_sector_show(struct md_rdev *rdev, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
+}
+
+static ssize_t
+ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
+{
+	unsigned long long sector;
+
+	if (kstrtoull(buf, 10, &sector) < 0)
+		return -EINVAL;
+	if (sector != (sector_t)sector)
+		return -EINVAL;
+
+	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
+	    rdev->raid_disk >= 0)
+		return -EBUSY;
+
+	if (rdev->mddev->persistent) {
+		if (rdev->mddev->major_version == 0)
+			return -EINVAL;
+		if ((sector > rdev->sb_start &&
+		     sector - rdev->sb_start > S16_MAX) ||
+		    (sector < rdev->sb_start &&
+		     rdev->sb_start - sector > -S16_MIN))
+			return -EINVAL;
+		rdev->ppl.offset = sector - rdev->sb_start;
+	} else if (!rdev->mddev->external) {
+		return -EBUSY;
+	}
+	rdev->ppl.sector = sector;
+	return len;
+}
+
+static struct rdev_sysfs_entry rdev_ppl_sector =
+__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
+
+static ssize_t
+ppl_size_show(struct md_rdev *rdev, char *page)
+{
+	return sprintf(page, "%u\n", rdev->ppl.size);
+}
+
+static ssize_t
+ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
+{
+	unsigned int size;
+
+	if (kstrtouint(buf, 10, &size) < 0)
+		return -EINVAL;
+
+	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
+	    rdev->raid_disk >= 0)
+		return -EBUSY;
+
+	if (rdev->mddev->persistent) {
+		if (rdev->mddev->major_version == 0)
+			return -EINVAL;
+		if (size > U16_MAX)
+			return -EINVAL;
+	} else if (!rdev->mddev->external) {
+		return -EBUSY;
+	}
+	rdev->ppl.size = size;
+	return len;
+}
+
+static struct rdev_sysfs_entry rdev_ppl_size =
+__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
+
 static struct attribute *rdev_default_attrs[] = {
 	&rdev_state.attr,
 	&rdev_errors.attr,
@@ -3167,6 +3239,8 @@ static struct attribute *rdev_default_attrs[] = {
 	&rdev_recovery_start.attr,
 	&rdev_bad_blocks.attr,
 	&rdev_unack_bad_blocks.attr,
+	&rdev_ppl_sector.attr,
+	&rdev_ppl_size.attr,
 	NULL,
 };
 static ssize_t
@@ -4903,6 +4977,46 @@ static struct md_sysfs_entry md_array_size =
 __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
        array_size_store);
 
+static ssize_t
+consistency_policy_show(struct mddev *mddev, char *page)
+{
+	int ret;
+
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		ret = sprintf(page, "journal\n");
+	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		ret = sprintf(page, "ppl\n");
+	} else if (mddev->bitmap) {
+		ret = sprintf(page, "bitmap\n");
+	} else if (mddev->pers) {
+		if (mddev->pers->sync_request)
+			ret = sprintf(page, "resync\n");
+		else
+			ret = sprintf(page, "none\n");
+	} else {
+		ret = sprintf(page, "unknown\n");
+	}
+
+	return ret;
+}
+
+static ssize_t
+consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	if (mddev->pers) {
+		return -EBUSY;
+	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
+		set_bit(MD_HAS_PPL, &mddev->flags);
+		return len;
+	} else {
+		return -EINVAL;
+	}
+}
+
+static struct md_sysfs_entry md_consistency_policy =
+__ATTR(consistency_policy, S_IRUGO | S_IWUSR, consistency_policy_show,
+       consistency_policy_store);
+
 static struct attribute *md_default_attrs[] = {
 	&md_level.attr,
 	&md_layout.attr,
@@ -4918,6 +5032,7 @@ static struct attribute *md_default_attrs[] = {
 	&md_reshape_direction.attr,
 	&md_array_size.attr,
 	&max_corr_read_errors.attr,
+	&md_consistency_policy.attr,
 	NULL,
 };
 
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 5/7] raid5-ppl: load and recover the log
From: Artur Paszkiewicz @ 2017-02-21 19:43 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170221194401.18733-1-artur.paszkiewicz@intel.com>

Load the log from each disk when starting the array and recover if the
array is dirty.

The initial empty PPL is written by mdadm. When loading the log we
verify the header checksum and signature. For external metadata arrays
the signature is verified in userspace, so here we read it from the
header, verifying only if it matches on all disks, and use it later when
writing PPL.

In addition to the header checksum, each header entry also contains a
checksum of its partial parity data. If the header is valid, recovery is
performed for each entry until an invalid entry is found. If the array
is not degraded and recovery using PPL fully succeeds, there is no need
to resync the array because data and parity will be consistent, so in
this case resync will be disabled.

Due to compatibility with IMSM implementations on other systems, we
can't assume that the recovery data block size is always 4K. Writes
generated by MD raid5 don't have this issue, but when recovering PPL
written in other environments it is possible to have entries with
512-byte sector granularity. The recovery code takes this into account
and also the logical sector size of the underlying drives.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c     |   5 +-
 2 files changed, 487 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index a00cabf1adf6..a7693353243a 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -16,6 +16,7 @@
 #include <linux/blkdev.h>
 #include <linux/slab.h>
 #include <linux/crc32c.h>
+#include <linux/async_tx.h>
 #include <linux/raid/md_p.h>
 #include "md.h"
 #include "raid5.h"
@@ -84,6 +85,10 @@ struct ppl_conf {
 	mempool_t *io_pool;
 	struct bio_set *bs;
 	mempool_t *meta_pool;
+
+	/* used only for recovery */
+	int recovered_entries;
+	int mismatch_count;
 };
 
 struct ppl_log {
@@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
 		ppl_io_unit_finished(io);
 }
 
+static void ppl_xor(int size, struct page *page1, struct page *page2,
+		    struct page *page_result)
+{
+	struct async_submit_ctl submit;
+	struct dma_async_tx_descriptor *tx;
+	struct page *xor_srcs[] = { page1, page2 };
+
+	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
+			  NULL, NULL, NULL, NULL);
+	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
+
+	async_tx_quiesce(&tx);
+}
+
+/*
+ * PPL recovery strategy: xor partial parity and data from all modified data
+ * disks within a stripe and write the result as the new stripe parity. If all
+ * stripe data disks are modified (full stripe write), no partial parity is
+ * available, so just xor the data disks.
+ *
+ * Recovery of a PPL entry shall occur only if all modified data disks are
+ * available and read from all of them succeeds.
+ *
+ * A PPL entry applies to a stripe, partial parity size for an entry is at most
+ * the size of the chunk. Examples of possible cases for a single entry:
+ *
+ * case 0: single data disk write:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
+ * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = data_size
+ *
+ * case 1: more than one data disk write:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
+ * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = data_size / modified_data_disks
+ *
+ * case 2: write to all data disks (also full stripe write):
+ *   data0    data1    data2                parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ |           | (no change)        |
+ * | -data- | -data- | -data- | --------> | xor all data       |
+ * | ------ | ------ | ------ | --------> | (no change)        |
+ * | ------ | ------ | ------ |           | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = 0
+ *
+ * The following cases are possible only in other implementations. The recovery
+ * code can handle them, but they are not generated at runtime because they can
+ * be reduced to cases 0, 1 and 2:
+ *
+ * case 3:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+ +----+    +--------------------+
+ * | ------ | -data- | -data- | | pp |    | data1 ^ data2 ^ pp |
+ * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp |
+ * | -data- | -data- | -data- | | -- | -> | xor all data       |
+ * | -data- | -data- | ------ | | pp |    | data0 ^ data1 ^ pp |
+ * +--------+--------+--------+ +----+    +--------------------+
+ * pp_size = chunk_size
+ *
+ * case 4:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+ +----+    +--------------------+
+ * | ------ | -data- | ------ | | pp |    | data1 ^ pp         |
+ * | ------ | ------ | ------ | | -- | -> | (no change)        |
+ * | ------ | ------ | ------ | | -- | -> | (no change)        |
+ * | -data- | ------ | ------ | | pp |    | data0 ^ pp         |
+ * +--------+--------+--------+ +----+    +--------------------+
+ * pp_size = chunk_size
+ */
+static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
+			     sector_t ppl_sector)
+{
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct mddev *mddev = ppl_conf->mddev;
+	struct r5conf *conf = mddev->private;
+	int block_size = ppl_conf->block_size;
+	struct page *pages;
+	struct page *page1;
+	struct page *page2;
+	sector_t r_sector_first;
+	sector_t r_sector_last;
+	int strip_sectors;
+	int data_disks;
+	int i;
+	int ret = 0;
+	char b[BDEVNAME_SIZE];
+	unsigned int pp_size = le32_to_cpu(e->pp_size);
+	unsigned int data_size = le32_to_cpu(e->data_size);
+
+	r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9);
+
+	if ((pp_size >> 9) < conf->chunk_sectors) {
+		if (pp_size > 0) {
+			data_disks = data_size / pp_size;
+			strip_sectors = pp_size >> 9;
+		} else {
+			data_disks = conf->raid_disks - conf->max_degraded;
+			strip_sectors = (data_size >> 9) / data_disks;
+		}
+		r_sector_last = r_sector_first +
+				(data_disks - 1) * conf->chunk_sectors +
+				strip_sectors;
+	} else {
+		data_disks = conf->raid_disks - conf->max_degraded;
+		strip_sectors = conf->chunk_sectors;
+		r_sector_last = r_sector_first + (data_size >> 9);
+	}
+
+	pages = alloc_pages(GFP_KERNEL, 1);
+	if (!pages)
+		return -ENOMEM;
+	page1 = pages;
+	page2 = pages + 1;
+
+	pr_debug("%s: array sector first: %llu last: %llu\n", __func__,
+		 (unsigned long long)r_sector_first,
+		 (unsigned long long)r_sector_last);
+
+	/* if start and end is 4k aligned, use a 4k block */
+	if (block_size == 512 &&
+	    (r_sector_first & (STRIPE_SECTORS - 1)) == 0 &&
+	    (r_sector_last & (STRIPE_SECTORS - 1)) == 0)
+		block_size = STRIPE_SIZE;
+
+	/* iterate through blocks in strip */
+	for (i = 0; i < strip_sectors; i += (block_size >> 9)) {
+		bool update_parity = false;
+		sector_t parity_sector;
+		struct md_rdev *parity_rdev;
+		struct stripe_head sh;
+		int disk;
+		int indent = 0;
+
+		pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i);
+		indent += 2;
+
+		memset(page_address(page1), 0, PAGE_SIZE);
+
+		/* iterate through data member disks */
+		for (disk = 0; disk < data_disks; disk++) {
+			int dd_idx;
+			struct md_rdev *rdev;
+			sector_t sector;
+			sector_t r_sector = r_sector_first + i +
+					    (disk * conf->chunk_sectors);
+
+			pr_debug("%s:%*s data member disk %d start\n",
+				 __func__, indent, "", disk);
+			indent += 2;
+
+			if (r_sector >= r_sector_last) {
+				pr_debug("%s:%*s array sector %llu doesn't need parity update\n",
+					 __func__, indent, "",
+					 (unsigned long long)r_sector);
+				indent -= 2;
+				continue;
+			}
+
+			update_parity = true;
+
+			/* map raid sector to member disk */
+			sector = raid5_compute_sector(conf, r_sector, 0,
+						      &dd_idx, NULL);
+			pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n",
+				 __func__, indent, "",
+				 (unsigned long long)r_sector, dd_idx,
+				 (unsigned long long)sector);
+
+			rdev = conf->disks[dd_idx].rdev;
+			if (!rdev) {
+				pr_debug("%s:%*s data member disk %d missing\n",
+					 __func__, indent, "", dd_idx);
+				update_parity = false;
+				break;
+			}
+
+			pr_debug("%s:%*s reading data member disk %s sector %llu\n",
+				 __func__, indent, "", bdevname(rdev->bdev, b),
+				 (unsigned long long)sector);
+			if (!sync_page_io(rdev, sector, block_size, page2,
+					REQ_OP_READ, 0, false)) {
+				md_error(mddev, rdev);
+				pr_debug("%s:%*s read failed!\n", __func__,
+					 indent, "");
+				ret = -EIO;
+				goto out;
+			}
+
+			ppl_xor(block_size, page1, page2, page1);
+
+			indent -= 2;
+		}
+
+		if (!update_parity)
+			continue;
+
+		if (pp_size > 0) {
+			pr_debug("%s:%*s reading pp disk sector %llu\n",
+				 __func__, indent, "",
+				 (unsigned long long)(ppl_sector + i));
+			if (!sync_page_io(log->rdev,
+					ppl_sector - log->rdev->data_offset + i,
+					block_size, page2, REQ_OP_READ, 0,
+					false)) {
+				pr_debug("%s:%*s read failed!\n", __func__,
+					 indent, "");
+				md_error(mddev, log->rdev);
+				ret = -EIO;
+				goto out;
+			}
+
+			ppl_xor(block_size, page1, page2, page1);
+		}
+
+		/* map raid sector to parity disk */
+		parity_sector = raid5_compute_sector(conf, r_sector_first + i,
+				0, &disk, &sh);
+		BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
+		parity_rdev = conf->disks[sh.pd_idx].rdev;
+
+		BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev);
+		pr_debug("%s:%*s write parity at sector %llu, disk %s\n",
+			 __func__, indent, "",
+			 (unsigned long long)parity_sector,
+			 bdevname(parity_rdev->bdev, b));
+		if (!sync_page_io(parity_rdev, parity_sector, block_size,
+				page1, REQ_OP_WRITE, 0, false)) {
+			pr_debug("%s:%*s parity write error!\n", __func__,
+				 indent, "");
+			md_error(mddev, parity_rdev);
+			ret = -EIO;
+			goto out;
+		}
+	}
+out:
+	__free_pages(pages, 1);
+	return ret;
+}
+
+static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
+{
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct md_rdev *rdev = log->rdev;
+	struct mddev *mddev = rdev->mddev;
+	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
+	struct page *page;
+	int i;
+	int ret = 0;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	/* iterate through all PPL entries saved */
+	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
+		struct ppl_header_entry *e = &pplhdr->entries[i];
+		u32 pp_size = le32_to_cpu(e->pp_size);
+		sector_t sector = ppl_sector;
+		int ppl_entry_sectors = pp_size >> 9;
+		u32 crc, crc_stored;
+
+		pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n",
+			 __func__, rdev->raid_disk, i,
+			 (unsigned long long)ppl_sector, pp_size);
+
+		crc = ~0;
+		crc_stored = le32_to_cpu(e->checksum);
+
+		/* read parial parity for this entry and calculate its checksum */
+		while (pp_size) {
+			int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size;
+
+			if (!sync_page_io(rdev, sector - rdev->data_offset,
+					s, page, REQ_OP_READ, 0, false)) {
+				md_error(mddev, rdev);
+				ret = -EIO;
+				goto out;
+			}
+
+			crc = crc32c_le(crc, page_address(page), s);
+
+			pp_size -= s;
+			sector += s >> 9;
+		}
+
+		crc = ~crc;
+
+		if (crc != crc_stored) {
+			/*
+			 * Don't recover this entry if the checksum does not
+			 * match, but keep going and try to recover other
+			 * entries.
+			 */
+			pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n",
+				 __func__, crc_stored, crc);
+			ppl_conf->mismatch_count++;
+		} else {
+			ret = ppl_recover_entry(log, e, ppl_sector);
+			if (ret)
+				goto out;
+			ppl_conf->recovered_entries++;
+		}
+
+		ppl_sector += ppl_entry_sectors;
+	}
+out:
+	__free_page(page);
+	return ret;
+}
+
+static int ppl_write_empty_header(struct ppl_log *log)
+{
+	struct page *page;
+	struct ppl_header *pplhdr;
+	struct md_rdev *rdev = log->rdev;
+	int ret = 0;
+
+	pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__,
+		 rdev->raid_disk, (unsigned long long)rdev->ppl.sector);
+
+	page = alloc_page(GFP_NOIO | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+
+	pplhdr = page_address(page);
+	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
+	pplhdr->signature = cpu_to_le32(log->ppl_conf->signature);
+	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
+
+	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
+			  PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) {
+		md_error(rdev->mddev, rdev);
+		ret = -EIO;
+	}
+
+	__free_page(page);
+	return ret;
+}
+
+static int ppl_load_distributed(struct ppl_log *log)
+{
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct md_rdev *rdev = log->rdev;
+	struct mddev *mddev = rdev->mddev;
+	struct page *page;
+	struct ppl_header *pplhdr;
+	u32 crc, crc_stored;
+	u32 signature;
+	int ret = 0;
+
+	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
+
+	/* read PPL header */
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
+			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
+		md_error(mddev, rdev);
+		ret = -EIO;
+		goto out;
+	}
+	pplhdr = page_address(page);
+
+	/* check header validity */
+	crc_stored = le32_to_cpu(pplhdr->checksum);
+	pplhdr->checksum = 0;
+	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
+
+	if (crc_stored != crc) {
+		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
+			 __func__, crc_stored, crc);
+		ppl_conf->mismatch_count++;
+		goto out;
+	}
+
+	signature = le32_to_cpu(pplhdr->signature);
+
+	if (mddev->external) {
+		/*
+		 * For external metadata the header signature is set and
+		 * validated in userspace.
+		 */
+		ppl_conf->signature = signature;
+	} else if (ppl_conf->signature != signature) {
+		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
+			 __func__, signature, ppl_conf->signature);
+		ppl_conf->mismatch_count++;
+		goto out;
+	}
+
+	/* attempt to recover from log if we are starting a dirty array */
+	if (!mddev->pers && mddev->recovery_cp != MaxSector)
+		ret = ppl_recover(log, pplhdr);
+out:
+	/* write empty header if we are starting the array */
+	if (!ret && !mddev->pers)
+		ret = ppl_write_empty_header(log);
+
+	__free_page(page);
+
+	pr_debug("%s: return: %d mismatch_count: %d recovered_entries: %d\n",
+		 __func__, ret, ppl_conf->mismatch_count,
+		 ppl_conf->recovered_entries);
+	return ret;
+}
+
+static int ppl_load(struct ppl_conf *ppl_conf)
+{
+	int ret = 0;
+	u32 signature = 0;
+	bool signature_set = false;
+	int i;
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		struct ppl_log *log = &ppl_conf->child_logs[i];
+
+		/* skip missing drive */
+		if (!log->rdev)
+			continue;
+
+		ret = ppl_load_distributed(log);
+		if (ret)
+			break;
+
+		/*
+		 * For external metadata we can't check if the signature is
+		 * correct on a single drive, but we can check if it is the same
+		 * on all drives.
+		 */
+		if (ppl_conf->mddev->external) {
+			if (!signature_set) {
+				signature = ppl_conf->signature;
+				signature_set = true;
+			} else if (signature != ppl_conf->signature) {
+				pr_warn("md/raid:%s: PPL header signature does not match on all member drives\n",
+					mdname(ppl_conf->mddev));
+				ret = -EINVAL;
+				break;
+			}
+		}
+	}
+
+	pr_debug("%s: return: %d mismatch_count: %d recovered_entries: %d\n",
+		 __func__, ret, ppl_conf->mismatch_count,
+		 ppl_conf->recovered_entries);
+	return ret;
+}
+
 void ppl_exit_log(struct ppl_conf *ppl_conf)
 {
 	kfree(ppl_conf->child_logs);
@@ -608,6 +1074,23 @@ int ppl_init_log(struct r5conf *conf)
 		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
 			mdname(mddev));
 
+	/* load and possibly recover the logs from the member disks */
+	ret = ppl_load(ppl_conf);
+
+	if (ret) {
+		goto err;
+	} else if (!mddev->pers &&
+		   mddev->recovery_cp == 0 && !mddev->degraded &&
+		   ppl_conf->recovered_entries > 0 &&
+		   ppl_conf->mismatch_count == 0) {
+		/*
+		 * If we are starting a dirty array and the recovery succeeds
+		 * without any issues, set the array as clean.
+		 */
+		mddev->recovery_cp = MaxSector;
+		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+	}
+
 	conf->ppl = ppl_conf;
 
 	return 0;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 21440b594878..8b52392457d8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7317,7 +7317,10 @@ static int raid5_run(struct mddev *mddev)
 
 	if (mddev->degraded > dirty_parity_disks &&
 	    mddev->recovery_cp != MaxSector) {
-		if (mddev->ok_start_degraded)
+		if (test_bit(MD_HAS_PPL, &mddev->flags))
+			pr_crit("md/raid:%s: starting dirty degraded array with PPL.\n",
+				mdname(mddev));
+		else if (mddev->ok_start_degraded)
 			pr_crit("md/raid:%s: starting dirty degraded array - data corruption possible.\n",
 				mdname(mddev));
 		else {
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 6/7] raid5-ppl: support disk hot add/remove with PPL
From: Artur Paszkiewicz @ 2017-02-21 19:44 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170221194401.18733-1-artur.paszkiewicz@intel.com>

Add a function to modify the log by removing an rdev when a drive fails
or adding when a spare/replacement is activated as a raid member.

Removing a disk just clears the child log rdev pointer. No new stripes
will be accepted for this child log in ppl_write_stripe() and running io
units will be processed without writing PPL to the device.

Adding a disk sets the child log rdev pointer and writes an empty PPL
header.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c     | 15 +++++++++++++++
 drivers/md/raid5.h     |  8 ++++++++
 3 files changed, 70 insertions(+)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index a7693353243a..c2070d124849 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -319,6 +319,12 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
 
 	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, ppl_conf->bs);
 	bio->bi_private = io;
+
+	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
+		ppl_log_endio(bio);
+		return;
+	}
+
 	bio->bi_end_io = ppl_log_endio;
 	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
 	bio->bi_bdev = log->rdev->bdev;
@@ -1098,3 +1104,44 @@ int ppl_init_log(struct r5conf *conf)
 	ppl_exit_log(ppl_conf);
 	return ret;
 }
+
+int ppl_modify_log(struct ppl_conf *ppl_conf, struct md_rdev *rdev,
+		   enum ppl_modify_log_operation operation)
+{
+	struct ppl_log *log;
+	int ret = 0;
+	char b[BDEVNAME_SIZE];
+
+	if (!rdev)
+		return -EINVAL;
+
+	pr_debug("%s: disk: %d operation: %s dev: %s\n",
+		 __func__, rdev->raid_disk,
+		 operation == PPL_MODIFY_LOG_DISK_REMOVE ? "remove" :
+		 (operation == PPL_MODIFY_LOG_DISK_ADD ? "add" : "?"),
+		 bdevname(rdev->bdev, b));
+
+	if (rdev->raid_disk < 0)
+		return 0;
+
+	if (rdev->raid_disk >= ppl_conf->count)
+		return -ENODEV;
+
+	log = &ppl_conf->child_logs[rdev->raid_disk];
+
+	mutex_lock(&log->io_mutex);
+	if (operation == PPL_MODIFY_LOG_DISK_REMOVE) {
+		log->rdev = NULL;
+	} else if (operation == PPL_MODIFY_LOG_DISK_ADD) {
+		ret = ppl_validate_rdev(rdev);
+		if (!ret) {
+			log->rdev = rdev;
+			ret = ppl_write_empty_header(log);
+		}
+	} else {
+		ret = -EINVAL;
+	}
+	mutex_unlock(&log->io_mutex);
+
+	return ret;
+}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8b52392457d8..b17e90f06f19 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7623,6 +7623,12 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 			*rdevp = rdev;
 		}
 	}
+	if (conf->ppl) {
+		err = ppl_modify_log(conf->ppl, rdev,
+				     PPL_MODIFY_LOG_DISK_REMOVE);
+		if (err)
+			goto abort;
+	}
 	if (p->replacement) {
 		/* We must have just cleared 'rdev' */
 		p->rdev = p->replacement;
@@ -7632,6 +7638,10 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 			   */
 		p->replacement = NULL;
 		clear_bit(WantReplacement, &rdev->flags);
+
+		if (conf->ppl)
+			err = ppl_modify_log(conf->ppl, p->rdev,
+					     PPL_MODIFY_LOG_DISK_ADD);
 	} else
 		/* We might have just removed the Replacement as faulty-
 		 * clear the bit just in case
@@ -7695,6 +7705,11 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 			if (rdev->saved_raid_disk != disk)
 				conf->fullsync = 1;
 			rcu_assign_pointer(p->rdev, rdev);
+
+			if (conf->ppl)
+				err = ppl_modify_log(conf->ppl, rdev,
+						     PPL_MODIFY_LOG_DISK_ADD);
+
 			goto out;
 		}
 	}
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index f915a7a0e752..43cfa5fa71b3 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -806,4 +806,12 @@ extern void ppl_exit_log(struct ppl_conf *log);
 extern int ppl_write_stripe(struct ppl_conf *log, struct stripe_head *sh);
 extern void ppl_write_stripe_run(struct ppl_conf *log);
 extern void ppl_stripe_write_finished(struct stripe_head *sh);
+
+enum ppl_modify_log_operation {
+	PPL_MODIFY_LOG_DISK_REMOVE,
+	PPL_MODIFY_LOG_DISK_ADD,
+};
+extern int ppl_modify_log(struct ppl_conf *log, struct md_rdev *rdev,
+			  enum ppl_modify_log_operation operation);
+
 #endif
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 7/7] raid5-ppl: runtime PPL enabling or disabling
From: Artur Paszkiewicz @ 2017-02-21 19:44 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170221194401.18733-1-artur.paszkiewicz@intel.com>

Allow writing to 'consistency_policy' attribute when the array is
active. Add a new function 'change_consistency_policy' to the
md_personality operations structure to handle the change in the
personality code. Values "ppl" and "resync" are accepted and
turn PPL on and off respectively.

When enabling PPL its location and size should first be set using
'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
written at this location on each member device.

Enabling or disabling PPL is performed under a suspended array.  The
raid5_reset_stripe_cache function frees the stripe cache and allocates
it again in order to allocate or free the ppl_pages for the stripes in
the stripe cache.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c        | 12 ++++++++---
 drivers/md/md.h        |  2 ++
 drivers/md/raid5-ppl.c |  4 ++++
 drivers/md/raid5.c     | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3ff979d538d4..b70e19513588 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5003,14 +5003,20 @@ consistency_policy_show(struct mddev *mddev, char *page)
 static ssize_t
 consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
 {
+	int err = 0;
+
 	if (mddev->pers) {
-		return -EBUSY;
+		if (mddev->pers->change_consistency_policy)
+			err = mddev->pers->change_consistency_policy(mddev, buf);
+		else
+			err = -EBUSY;
 	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
 		set_bit(MD_HAS_PPL, &mddev->flags);
-		return len;
 	} else {
-		return -EINVAL;
+		err = -EINVAL;
 	}
+
+	return err ? err : len;
 }
 
 static struct md_sysfs_entry md_consistency_policy =
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 66a5a16e79f7..88a5155c152e 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -548,6 +548,8 @@ struct md_personality
 	/* congested implements bdi.congested_fn().
 	 * Will not be called while array is 'suspended' */
 	int (*congested)(struct mddev *mddev, int bits);
+	/* Changes the consistency policy of an active array. */
+	int (*change_consistency_policy)(struct mddev *mddev, const char *buf);
 };
 
 struct md_sysfs_entry {
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index c2070d124849..a5c4d3333bce 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -1095,6 +1095,10 @@ int ppl_init_log(struct r5conf *conf)
 		 */
 		mddev->recovery_cp = MaxSector;
 		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+	} else if (mddev->pers && ppl_conf->mismatch_count > 0) {
+		/* no mismatch allowed when enabling PPL for a running array */
+		ret = -EINVAL;
+		goto err;
 	}
 
 	conf->ppl = ppl_conf;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b17e90f06f19..754fb8e1c76f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8319,6 +8319,63 @@ static void *raid6_takeover(struct mddev *mddev)
 	return setup_conf(mddev);
 }
 
+static void raid5_reset_stripe_cache(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+
+	mutex_lock(&conf->cache_size_mutex);
+	while (conf->max_nr_stripes &&
+	       drop_one_stripe(conf))
+		;
+	while (conf->min_nr_stripes > conf->max_nr_stripes &&
+	       grow_one_stripe(conf, GFP_KERNEL))
+		;
+	mutex_unlock(&conf->cache_size_mutex);
+}
+
+static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
+{
+	struct r5conf *conf;
+	int err;
+
+	err = mddev_lock(mddev);
+	if (err)
+		return err;
+	conf = mddev->private;
+	if (!conf) {
+		mddev_unlock(mddev);
+		return -ENODEV;
+	}
+
+	if (strncmp(buf, "ppl", 3) == 0 &&
+	    !test_bit(MD_HAS_PPL, &mddev->flags)) {
+		mddev_suspend(mddev);
+		err = ppl_init_log(conf);
+		if (!err) {
+			set_bit(MD_HAS_PPL, &mddev->flags);
+			raid5_reset_stripe_cache(mddev);
+		}
+		mddev_resume(mddev);
+	} else if (strncmp(buf, "resync", 6) == 0 &&
+		   test_bit(MD_HAS_PPL, &mddev->flags)) {
+		mddev_suspend(mddev);
+		ppl_exit_log(conf->ppl);
+		conf->ppl = NULL;
+		clear_bit(MD_HAS_PPL, &mddev->flags);
+		raid5_reset_stripe_cache(mddev);
+		mddev_resume(mddev);
+	} else {
+		err = -EINVAL;
+	}
+
+	if (!err)
+		md_update_sb(mddev, 1);
+
+	mddev_unlock(mddev);
+
+	return err;
+}
+
 static struct md_personality raid6_personality =
 {
 	.name		= "raid6",
@@ -8364,6 +8421,7 @@ static struct md_personality raid5_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
 	.congested	= raid5_congested,
+	.change_consistency_policy = raid5_change_consistency_policy,
 };
 
 static struct md_personality raid4_personality =
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-21 20:09 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <20170221174542.rer73ywil3oq26gj@kernel.org>

On 2017/2/22 上午1:45, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
>> On 2017/2/21 上午8:29, NeilBrown wrote:
>>> On Mon, Feb 20 2017, Coly Li wrote:
>>>
>>>>> 在 2017年2月20日，下午3:04，Shaohua Li <shli@kernel.org> 写道：
>>>>>
>>>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>>>>>
>>>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>>>>>
>>>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
>>>>>>>>> going to change this as Shaohua wantsthe spitting to 
>>>>>>>>> happen in a separate function, which I agree with, but
>>>>>>>>> there is something else wrong here. Calling
>>>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
>>>>>>>>> It is OK for simple devices, but when one request can
>>>>>>>>> wait for another request to the same device it can 
>>>>>>>>> deadlock. This can happen with raid1.  If a resync
>>>>>>>>> request calls raise_barrier() between one request and
>>>>>>>>> the next, then the next has to wait for the resync
>>>>>>>>> request, which has to wait for the first request. As
>>>>>>>>> the first request will be stuck in the queue in 
>>>>>>>>> generic_make_request(), you get a deadlock.
>>>>>>>>
>>>>>>>> For md raid1, queue in generic_make_request(), can I
>>>>>>>> understand it as bio_list_on_stack in this function? And
>>>>>>>> queue in underlying device, can I understand it as the
>>>>>>>> data structures like plug->pending and 
>>>>>>>> conf->pending_bio_list ?
>>>>>>>
>>>>>>> Yes, the queue in generic_make_request() is the
>>>>>>> bio_list_on_stack.  That is the only queue I am talking
>>>>>>> about.  I'm not referring to plug->pending or
>>>>>>> conf->pending_bio_list at all.
>>>>>>>
>>>>>>>>
>>>>>>>> I still don't get the point of deadlock, let me try to
>>>>>>>> explain why I don't see the possible deadlock. If a bio
>>>>>>>> is split, and the first part is processed by
>>>>>>>> make_request_fn(), and then a resync comes and it will 
>>>>>>>> raise a barrier, there are 3 possible conditions, - the
>>>>>>>> resync I/O tries to raise barrier on same bucket of the
>>>>>>>> first regular bio. Then the resync task has to wait to
>>>>>>>> the first bio drops its conf->nr_pending[idx]
>>>>>>>
>>>>>>> Not quite. First, the resync task (in raise_barrier()) will
>>>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
>>>>>>> happens immediately. Then the resync_task will increment
>>>>>>> ->barrier[idx]. Only then will it wait for the first bio to
>>>>>>> drop ->nr_pending[idx]. The processing of that first bio
>>>>>>> will have submitted bios to the underlying device, and they
>>>>>>> will be in the bio_list_on_stack queue, and will not be
>>>>>>> processed until raid1_make_request() completes.
>>>>>>>
>>>>>>> The loop in raid1_make_request() will then call
>>>>>>> make_request_fn() which will call wait_barrier(), which
>>>>>>> will wait for ->barrier[idx] to be zero.
>>>>>>
>>>>>> Thinking more carefully about this.. the 'idx' that the
>>>>>> second bio will wait for will normally be different, so there
>>>>>> won't be a deadlock after all.
>>>>>>
>>>>>> However it is possible for hash_long() to produce the same
>>>>>> idx for two consecutive barrier_units so there is still the
>>>>>> possibility of a deadlock, though it isn't as likely as I
>>>>>> thought at first.
>>>>>
>>>>> Wrapped the function pointer issue Neil pointed out into Coly's
>>>>> original patch. Also fix a 'use-after-free' bug. For the
>>>>> deadlock issue, I'll add below patch, please check.
>>>>>
>>>>> Thanks, Shaohua
>>>>>
>>>>
>>
>> Neil,
>>
>> Thanks for your patient explanation, I feel I come to follow up what
>> you mean. Let me try to re-tell what I understand, correct me if I am
>> wrong.
>>
>>
>>>> Hmm, please hold, I am still thinking of it. With barrier bucket
>>>> and hash_long(), I don't see dead lock yet. For raid10 it might
>>>> happen, but once we have barrier bucket on it , there will no
>>>> deadlock.
>>>>
>>>> My question is, this deadlock only happens when a big bio is
>>>> split, and the split small bios are continuous, and the resync io
>>>> visiting barrier buckets in sequntial order too. In the case if
>>>> adjacent split regular bios or resync bios hit same barrier
>>>> bucket, it will be a very big failure of hash design, and should
>>>> have been found already. But no one complain it, so I don't
>>>> convince myself tje deadlock is real with io barrier buckets
>>>> (this is what Neil concerns).
>>>
>>> I think you are wrong about the design goal of a hash function. 
>>> When feed a sequence of inputs, with any stride (i.e. with any
>>> constant difference between consecutive inputs), the output of the
>>> hash function should appear to be random. A random sequence can
>>> produce the same number twice in a row. If the hash function
>>> produces a number from 0 to N-1, you would expect two consecutive
>>> outputs to be the same about once every N inputs.
>>>
>>
>> Yes, you are right. But when I mentioned hash conflict, I limit the
>> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
>> address divided by 64MB I/O barrier unit size, its value range is
>> reduced to [0, 1<<38].
>>
>> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
>>
>> For DISCARD bio, its maximum size is 4GB, it could be split into 65
>> bios at most.
>>
>> Then in this patch, the hash question is degraded to: for any
>> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
>> these 65 integers into range [0, 1023], will any hash conflict happen
>> among these integers ?
>>
>> I tried a half range [0, 1<<37] to check hash conflict, by writing a
>> simple code to emulate hash calculation in the new I/O barrier patch,
>> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
>> 1<<37] for hash conflict.
>>
>> On a 20 core CPU each run spent 7+ hours, finally I find no hash
>> conflict detected up to 512 consecutive integers in above limited
>> condition. For 1024, there are a lot hash conflict detected.
>>
>> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
>> almost all existing md raid configuration. So for current kernel
>> implementation and real world device, for a single bio, there is no
>> possible hash conflict the new I/O barrier patch.
>>
>> If bi_iter.bi_size changes from unsigned int to unsigned long in
>> future, the above assumption will be wrong. There will be hash
>> conflict, and potential dead lock, which is quite implicit. Yes, I
>> agree with you. No, bio split inside loop is not perfect.
>>
>>> Even if there was no possibility of a deadlock from a resync
>>> request happening between two bios, there are other possibilities.
>>>
>>
>> The bellowed text makes me know more about raid1 code, but confuses me
>> more as well. Here comes my questions,
>>
>>> It is not, in general, safe to call mempool_alloc() twice in a
>>> row, without first ensuring that the first allocation will get
>>> freed by some other thread.  raid1_write_request() allocates from
>>> r1bio_pool, and then submits bios to the underlying device, which
>>> get queued on bio_list_on_stack.  They will not be processed until
>>> after raid1_make_request() completes, so when raid1_make_request
>>> loops around and calls raid1_write_request() again, it will try to
>>> allocate another r1bio from r1bio_pool, and this might end up
>>> waiting for the r1bio which is trapped and cannot complete.
>>>
>>
>> Can I say that it is because blk_finish_plug() won't be called before
>> raid1_make_request() returns ? Then in raid1_write_request(), mbio
>> will be added into plug->pending, but before blk_finish_plug() is
>> called, they won't be handled.
> 
> blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
> current->bio_list, not in plug list.
>  

Oops, I messed them up,  thank you for the clarifying :-)

>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
>>> impossible.  If 256 threads all attempt a write (or read) that
>>> crosses a boundary, then they will consume all 256 preallocated
>>> entries, and want more. If there is no free memory, they will block
>>> indefinitely.
>>>
>>
>> If raid1_make_request() is modified into this way,
>> +	if (bio_data_dir(split) == READ)
>> +		raid1_read_request(mddev, split);
>> +	else
>> +		raid1_write_request(mddev, split);
>> +	if (split != bio)
>> +		generic_make_request(bio);
>>
>> Then the original bio will be added into the bio_list_on_stack of top
>> level generic_make_request(), current->bio_list is initialized, when
>> generic_make_request() is called nested in raid1_make_request(), the
>> split bio will be added into current->bio_list and nothing else happens.
>>
>> After the nested generic_make_request() returns, the code back to next
>> code of generic_make_request(),
>> 2022                         ret = q->make_request_fn(q, bio);
>> 2023
>> 2024                         blk_queue_exit(q);
>> 2025
>> 2026                         bio = bio_list_pop(current->bio_list);
>>
>> bio_list_pop() will return the second half of the split bio, and it is
> 
> So in above sequence, the curent->bio_list will has bios in below sequence:
> bios to underlaying disks, second half of original bio
> 
> bio_list_pop will pop bios to underlaying disks first, handle them, then the
> second half of original bio.
> 
> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
> array, handling the middle layer bio will make the 3rd layer bio hold to
> bio_list again.
> 

Could you please give me more hint,
- What is the meaning of "hold" from " make the 3rd layer bio hold to
bio_list again" ?
- Why deadlock happens if the 3rd layer bio hold to bio_list again ?

Thanks in advance.

Coly

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-21 20:16 UTC (permalink / raw)
  To: Wols Lists
  Cc: Shaohua Li, NeilBrown, NeilBrown, linux-raid, Shaohua Li,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <58AC92F9.5050108@youngman.org.uk>

On 2017/2/22 上午3:20, Wols Lists wrote:
> On 21/02/17 11:30, Coly Li wrote:
>> On 2017/2/21 上午2:14, Wols Lists wrote:
>>> On 20/02/17 08:07, Coly Li wrote:
>>>> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..
>>>
>>> Can you force gcc to inline or compile a function? Isn't it dangerous to
>>> rely on default behaviour and assume it won't change when the compiler
>>> is upgraded?
>>
>> I choose to trust compiler, and trust the people behind gcc.
>>
> I admire your faith. I seem to remember several occasions where the gcc
> people added new optimisations and caused all sorts of subtle havoc with
> the kernel where it relied on the old behaviour. Don't forget - the
> linux kernel is one of the compiler's most demanding customers. And
> don't forget also - there are quite a few people now using llvm to
> compile the kernel (it may not yet be working - I think it is certainly
> for simple use cases) so tests on gcc don't guarantee it'll work for
> everyone ...

I know the risk, but I don't think I can figure out where gcc goes wrong
by myself. So I have to choose trust compiler developers.

> 
> I think you can trace the addition of many kernel compile-time flags to
> that sort of thing - disabling new optimisations.

Do you suggest that I can put my eyes on kernel compiling command lines
and I will find many compile-time flags which indeed disables some new
gcc optimization options ?

If I understand you correctly, please permit me to say this is a good
point. I will notice these kind of flags, and check what they mean :-)

Thanks.

Coly

^ permalink raw reply

* Re: Trouble reassembling RAID10
From: Roger Roglans @ 2017-02-21 20:24 UTC (permalink / raw)
  To: Wols Lists; +Cc: Phil Turmel, linux-raid
In-Reply-To: <58AC95F3.3010601@youngman.org.uk>

Hey Wol,
Yes I followed it and it was immensely helpful. I guess I didn't want
to make my post too long so that is why I only included the shortened
command outputs; in the future I'll just include everything. I believe
that the big issue was that the --force command did not work when it
seemed like it should have. I will keep on the lookout for that bug
again in the future.

Just a note on the wiki: "When Things Go Wrogn" should be "When Things Go Wrong"

Best,
Roger

On Tue, Feb 21, 2017 at 1:33 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 21/02/17 18:38, Roger Roglans wrote:
>> Hi Phil,
>>
>> seems very useful to know in the future. I ended up just assuming
>> clean and using "--create". Since I was able to discern the exact
>> configurations, I was able to mount it and am currently transferring
>> data. I know it was not the ideal solution but I believe that it
>> worked out with only minimal corruption. I might have problems with
>> another array soon. If so, I will certainly contact this mailing list
>> again.
>
> If I can plug my own work :-) there is now a section on the linux wiki
> about troubleshooting an array, and what data to gather for the list.
> Look at
>
> https://raid.wiki.kernel.org/index.php/Linux_Raid#When_Things_Go_Wrogn
>
> That should be enough to fix simple problems or, for more serious ones,
> you'll have the bulk if not all the information the members of the list
> will need, saving the back-and-forth of "can we have this, can we have
> that".
>
> If you find any problems with the information on the wiki, let me know
> and I'll endeavour to fix it.
>
> Cheers,
> Wol

^ permalink raw reply

* Re: Trouble reassembling RAID10
From: Wols Lists @ 2017-02-21 21:14 UTC (permalink / raw)
  To: Roger Roglans; +Cc: linux-raid
In-Reply-To: <CAPXQET=rms55PMjGkDmJBuQ_qgW5pPzoANgz1XWFkqBeGj9UMg@mail.gmail.com>

On 21/02/17 20:24, Roger Roglans wrote:
> Just a note on the wiki: "When Things Go Wrogn" should be "When Things Go Wrong"

That's a long and hallowed deliberate smelling pistake :-) Goes back at
least 30 years ...

Cheers,
Wol

^ permalink raw reply

* [PATCH] md/linear: shutup lockdep warnning
From: Shaohua Li @ 2017-02-21 21:55 UTC (permalink / raw)
  To: linux-raid; +Cc: Coly Li

Commit 03a9e24(md linear: fix a race between linear_add() and
linear_congested()) introduces the warnning.

Cc: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/linear.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 789008b..5b06b0d 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -224,7 +224,8 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
 	 * oldconf until no one uses it anymore.
 	 */
 	mddev_suspend(mddev);
-	oldconf = rcu_dereference(mddev->private);
+	oldconf = rcu_dereference_protected(mddev->private,
+			lockdep_is_held(&mddev->reconfig_mutex));
 	mddev->raid_disks++;
 	WARN_ONCE(mddev->raid_disks != newconf->raid_disks,
 		"copied raid_disks doesn't match mddev->raid_disks");
-- 
2.9.3


^ permalink raw reply related

* [PATCH] md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
From: Shaohua Li @ 2017-02-21 23:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Ming Lei

There are two issues, introduced by commit 8e58e32(md/raid1: use
bio_clone_bioset_partial() in case of write behind):
- bio_clone_bioset_partial() uses bytes instead of sectors as parameters
- in writebehind mode, we return bio if all !writemostly disk bios finish,
  which could happen before writemostly disk bios run. So all
  writemostly disk bios should have their bvec. Here we just make sure
  all bios are cloned instead of fast cloned.

Cc: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e1ee446..33526b4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1472,8 +1472,8 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 			    !waitqueue_active(&bitmap->behind_wait)) {
 				mbio = bio_clone_bioset_partial(bio, GFP_NOIO,
 								mddev->bio_set,
-								offset,
-								max_sectors);
+								offset << 9,
+								max_sectors << 9);
 				alloc_behind_pages(mbio, r1_bio);
 			}
 
@@ -1485,8 +1485,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 		}
 
 		if (!mbio) {
-			mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
-			bio_trim(mbio, offset, max_sectors);
+			if (r1_bio->behind_bvecs)
+				mbio = bio_clone_bioset_partial(bio, GFP_NOIO,
+								mddev->bio_set,
+								offset << 9,
+								max_sectors << 9);
+			else {
+				mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
+				bio_trim(mbio, offset, max_sectors);
+			}
 		}
 
 		if (r1_bio->behind_bvecs) {
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH] md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
From: Ming Lei @ 2017-02-22  0:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <81018680b558c83936577af437ee4b29d0d5cb3e.1487719900.git.shli@fb.com>

On Wed, Feb 22, 2017 at 7:32 AM, Shaohua Li <shli@fb.com> wrote:
> There are two issues, introduced by commit 8e58e32(md/raid1: use
> bio_clone_bioset_partial() in case of write behind):
> - bio_clone_bioset_partial() uses bytes instead of sectors as parameters
> - in writebehind mode, we return bio if all !writemostly disk bios finish,
>   which could happen before writemostly disk bios run. So all
>   writemostly disk bios should have their bvec. Here we just make sure
>   all bios are cloned instead of fast cloned.
>
> Cc: Ming Lei <tom.leiming@gmail.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid1.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index e1ee446..33526b4 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1472,8 +1472,8 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
>                             !waitqueue_active(&bitmap->behind_wait)) {
>                                 mbio = bio_clone_bioset_partial(bio, GFP_NOIO,
>                                                                 mddev->bio_set,
> -                                                               offset,
> -                                                               max_sectors);
> +                                                               offset << 9,
> +                                                               max_sectors << 9);

This is my fault, :-(

>                                 alloc_behind_pages(mbio, r1_bio);
>                         }
>
> @@ -1485,8 +1485,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
>                 }
>
>                 if (!mbio) {
> -                       mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
> -                       bio_trim(mbio, offset, max_sectors);
> +                       if (r1_bio->behind_bvecs)
> +                               mbio = bio_clone_bioset_partial(bio, GFP_NOIO,
> +                                                               mddev->bio_set,
> +                                                               offset << 9,
> +                                                               max_sectors << 9);
> +                       else {
> +                               mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
> +                               bio_trim(mbio, offset, max_sectors);
> +                       }

This is correct.

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

Thanks,
Ming Lei

^ permalink raw reply

* Re: Reshape stalled at first badblock location (was: RAID 5 --assemble doesn't recognize all overlays as component devices)
From: George Rapp @ 2017-02-22  1:12 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Linux-RAID, Matthew Krumwiede, NeilBrown, Jes.Sorensen
In-Reply-To: <20170221175801.wt64t2tzcvg3sfmc@kernel.org>

> On Mon, Feb 20, 2017 at 05:18:46PM -0500, George Rapp wrote:
>> On Sat, Feb 11, 2017 at 7:32 PM, George Rapp <george.rapp@gmail.com> wrote:
>> [...snip...]
>>
>> When I try to assemble the RAID 5 array, though, the process gets
>> stuck at the location of the first bad block. The assemble command is:
>>
>> [...snip...]
>>
>> The md4_raid5 process immediately spikes to 100% CPU utilization, and
>> the reshape stops at 1901225472 KiB (which is exactly half of the
>> first bad sector value, 3802454640):
>>
> [...snip...]
On Tue, Feb 21, 2017 at 4:51 AM, Tomasz Majchrzak
<tomasz.majchrzak@intel.com> wrote:
> As long as you're sure the data on the disk is valid, I believe clearing
> bad block list manually in metadata (no easy way to do it) would allow
> reshape to complete.
>
> Tomek
On Tue, Feb 21, 2017 at 12:58 PM, Shaohua Li <shli@kernel.org> wrote:
>
> Add Neil and Jes.
>
> Yes, there were similar reports before. When reshape finds nadblocks, the
> reshape will do an infinite loop without any progress. I think there are two
> things we need to do:
>
> - Make reshape more robust. Maybe reshape should bail out if badblocks found.
> - Add an option in mdadm to force reset badblocks

OK, I examined the structure of the superblock and the badblocks
array. My first attempt was to zero out the bblog_offset and
bblog_size in the md superblock using dd (but that causes the checksum
to be different than the sb_csum in the superblock, and the mdadm
--assemble fails. I didn't want to research how to recalculate the
checksum unless I really, really have to.  8^)

Running mdadm under gdb, I determined that my bblog_offset was 72
sectors from the start of the md superblock), and filled that space
with 0xff characters in my overlay file:

# dd if=/dev/mapper/sdg4 bs=512 count=1 skip=73 of=ffffffff
# dd if=ffffffff of=/dev/mapper/sdg4 bs=512 count=1 seek=72

That convinced mdadm that I have a badblocks list, but it's empty:

# mdadm --examine-badblocks /dev/mapper/sdg4
Bad-blocks on /dev/mapper/sdg4:
#

Once I did that, and restarted the array with my overlay files:

# mdadm --assemble --force /dev/md4
--backup-file=/home/gwr/2017/2017-01/md4_backup__2017-01-25
/dev/mapper/sde4 /dev/mapper/sdf4 /dev/mapper/sdh4 /dev/mapper/sdl4
/dev/mapper/sdg4 /dev/mapper/sdk4 /dev/mapper/sdi4 /dev/mapper/sdj4
/dev/mapper/sdb4
mdadm: accepting backup with timestamp 1485366772 for array with
timestamp 1487645030
mdadm: /dev/md4 has been started with 9 drives (out of 10).
#

The reshape operation got past the two positions where it had frozen
earlier, and didn't throw any obvious errors to /var/log/messages, so
Tomek's suggestion seems to clear the badblocks seems to have worked.
However, this was in the overlay files, not the actual devices.

Before I proceed for real, does clearing the badblocks log and
assembling the array seem like my best option?

-- 
George Rapp  (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox