* Re: [PATCH 1/1] dm raid: fix compat_features validation
From: Andy Whitcroft @ 2016-10-11 15:38 UTC (permalink / raw)
To: Heinz Mauelshagen
Cc: Mike Snitzer, linux-kernel, linux-raid, dm-devel, Shaohua Li,
Alasdair Kergon
In-Reply-To: <1c517f14-1234-7844-fc6a-cd1b9698fb8b@redhat.com>
On Tue, Oct 11, 2016 at 05:04:34PM +0200, Heinz Mauelshagen wrote:
>
> Andy,
>
> good catch.
>
> We should rather check for V190 support only in case any
> compat feature flags are actually set.
>
> {
> + if (le32_to_cpu(sb->compat_features) &&
> + le32_to_cpu(sb->compat_features) != FEATURE_FLAG_SUPPORTS_V190)
> {
> rs->ti->error = "Unable to assemble array: Unknown flag(s)
> in compatible feature flags";
> return -EINVAL;
> }
If the feature flags are single bit combinations then I believe the
below does check exactly that. Checking for no 1s outside of the
expected features, caring not for the value of the valid bits:
+ if (le32_to_cpu(sb->compat_features) & ~(FEATURE_FLAG_SUPPORTS_V190)) {
with the possibilty to or in additional feature bits as they are added.
-apw
^ permalink raw reply
* Re: [dm-devel] [PATCH 1/1] dm raid: fix compat_features validation
From: Heinz Mauelshagen @ 2016-10-11 15:44 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Mike Snitzer, linux-kernel, linux-raid, dm-devel, Shaohua Li,
Alasdair Kergon
In-Reply-To: <20161011153808.nmyf6hafjaadcemw@brain>
On 10/11/2016 05:38 PM, Andy Whitcroft wrote:
> On Tue, Oct 11, 2016 at 05:04:34PM +0200, Heinz Mauelshagen wrote:
>> Andy,
>>
>> good catch.
>>
>> We should rather check for V190 support only in case any
>> compat feature flags are actually set.
>>
>> {
>> + if (le32_to_cpu(sb->compat_features) &&
>> + le32_to_cpu(sb->compat_features) != FEATURE_FLAG_SUPPORTS_V190)
>> {
>> rs->ti->error = "Unable to assemble array: Unknown flag(s)
>> in compatible feature flags";
>> return -EINVAL;
>> }
> If the feature flags are single bit combinations then I believe the
> below does check exactly that. Checking for no 1s outside of the
> expected features, caring not for the value of the valid bits:
>
> + if (le32_to_cpu(sb->compat_features) & ~(FEATURE_FLAG_SUPPORTS_V190)) {
>
> with the possibilty to or in additional feature bits as they are added.
Thanks,
I prefer this to be easier readable.
>
> -apw
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply
* [PATCH 1/1 V2] dm raid: fix compat_features validation
From: Andy Whitcroft @ 2016-10-11 16:21 UTC (permalink / raw)
To: Heinz Mauelshagen
Cc: Mike Snitzer, linux-kernel, linux-raid, dm-devel, Shaohua Li,
Alasdair Kergon
In-Reply-To: <591b9d8d-2036-2d0f-14f2-af176b5beaea@redhat.com>
From a30fba068e41214cb0ffcb14e68722482765e0c9 Mon Sep 17 00:00:00 2001
From: Andy Whitcroft <apw@canonical.com>
Date: Tue, 11 Oct 2016 15:16:57 +0100
In ecbfb9f118bce4 ("dm raid: add raid level takeover support") a new
compatible feature flag was added. Validation for these compat_features
was added but this only passes for new raid mappings with this feature
flag. This causes previously created raid mappings to be failed at
import.
Check compat_features for the only valid combination.
Fixes: ecbfb9f118bce4 ("dm raid: add raid level takeover support")
Signed-off-by: Andy Whitcroft <apw@canonical.com>
---
drivers/md/dm-raid.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
V2: simplify checks as per maintainer.
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 8abde6b..2a39700 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -2258,7 +2258,8 @@ static int super_validate(struct raid_set *rs, struct md_rdev *rdev)
if (!mddev->events && super_init_validation(rs, rdev))
return -EINVAL;
- if (le32_to_cpu(sb->compat_features) != FEATURE_FLAG_SUPPORTS_V190) {
+ if (le32_to_cpu(sb->compat_features) &&
+ le32_to_cpu(sb->compat_features) != FEATURE_FLAG_SUPPORTS_V190) {
rs->ti->error = "Unable to assemble array: Unknown flag(s) in compatible feature flags";
return -EINVAL;
}
--
2.9.3
^ permalink raw reply related
* Re: [PATCH 1/1 V2] dm raid: fix compat_features validation
From: Heinz Mauelshagen @ 2016-10-11 16:53 UTC (permalink / raw)
To: Andy Whitcroft
Cc: Mike Snitzer, linux-kernel, linux-raid, dm-devel, Shaohua Li,
Alasdair Kergon
In-Reply-To: <20161011162148.hz2ncmbvtmtmzjkz@brain>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
On 10/11/2016 06:21 PM, Andy Whitcroft wrote:
> From a30fba068e41214cb0ffcb14e68722482765e0c9 Mon Sep 17 00:00:00 2001
> From: Andy Whitcroft <apw@canonical.com>
> Date: Tue, 11 Oct 2016 15:16:57 +0100
>
> In ecbfb9f118bce4 ("dm raid: add raid level takeover support") a new
> compatible feature flag was added. Validation for these compat_features
> was added but this only passes for new raid mappings with this feature
> flag. This causes previously created raid mappings to be failed at
> import.
Clarification:
to allow for feature checks, the compat_features member was
in the dm-raid superblock from the beginning (so before ecbfb9f118bce4).
It got renamed from features to compat_features because incompat_features
got introduced with that commit together with the problematic check
of compat_features you thankfully found.
>
> Check compat_features for the only valid combination.
>
> Fixes: ecbfb9f118bce4 ("dm raid: add raid level takeover support")
> Signed-off-by: Andy Whitcroft <apw@canonical.com>
> ---
> drivers/md/dm-raid.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> V2: simplify checks as per maintainer.
>
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 8abde6b..2a39700 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -2258,7 +2258,8 @@ static int super_validate(struct raid_set *rs, struct md_rdev *rdev)
> if (!mddev->events && super_init_validation(rs, rdev))
> return -EINVAL;
>
> - if (le32_to_cpu(sb->compat_features) != FEATURE_FLAG_SUPPORTS_V190) {
> + if (le32_to_cpu(sb->compat_features) &&
> + le32_to_cpu(sb->compat_features) != FEATURE_FLAG_SUPPORTS_V190) {
> rs->ti->error = "Unable to assemble array: Unknown flag(s) in compatible feature flags";
> return -EINVAL;
> }
^ permalink raw reply
* growing a RAID-10 array with mdadm 3.3.1+ ?
From: moft @ 2016-10-11 17:26 UTC (permalink / raw)
To: linux-raid
Hi
I have a 4-disk RAID10 array
md0 : active raid10 sda1[4] sdb1[3] sdc1[2] sdd1[1]
1953259520 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
bitmap: 0/15 pages [0KB], 65536KB chunk
It was created with this command
mdadm --create /dev/md0 --level=raid10 --raid-devices=4 \
--name=md0 --homehost="<none>" \
--metadata=1.2 --bitmap=internal --layout=f2 --chunk=512 \
/dev/sd[abcd]1
It's running on a linux machine
uname -rm
4.8.1-2.g4861355-default x86_64
mdadm --version
mdadm - v3.3.1 - 5th June 2014
I need to add storage to the array.
I'd like to grow it by adding two disks (/dev/sd[ef]), to end up with a 6-disk array.
I know I can completely wipe it out and recreate it with 6-disks.
But I'd rather grow/extend it, Instead.
*CAN* I safely grow/expand it?
The ChangeLog for mdadm 3.3.1 says
Changes Prior to release 3.3
- Some array reshapes can proceed without needing backup file.
This is done by changing the 'data_offset' so we never need to write
any data back over where it was before. If there is no "head space"
or "tail space" to allow data_offset to change, the old mechanism
with a backup file can still be used.
- RAID10 arrays can be reshaped to change the number of devices,
change the chunk size, or change the layout between 'near'
and 'offset'.
This will always change data_offset, and will fail if there is no
room for data_offset to be moved.
So far I haven't found any specific "how to" for this process.
(1) The changelog refers to 'near' and 'offset' layouts, but doesn't mention 'far'.
CAN I safely grow this layout=f2 array ?
(2) If I can, what's the detailed procedure to do it?
Thanks
Mike
^ permalink raw reply
* Re: [PATCH 1/1] dm raid: fix compat_features validation
From: Mike Snitzer @ 2016-10-11 17:44 UTC (permalink / raw)
To: Heinz Mauelshagen
Cc: linux-kernel, linux-raid, dm-devel, Andy Whitcroft, Shaohua Li,
Alasdair Kergon
In-Reply-To: <591b9d8d-2036-2d0f-14f2-af176b5beaea@redhat.com>
On Tue, Oct 11 2016 at 11:44am -0400,
Heinz Mauelshagen <heinzm@redhat.com> wrote:
>
>
> On 10/11/2016 05:38 PM, Andy Whitcroft wrote:
> >On Tue, Oct 11, 2016 at 05:04:34PM +0200, Heinz Mauelshagen wrote:
> >>Andy,
> >>
> >>good catch.
> >>
> >>We should rather check for V190 support only in case any
> >>compat feature flags are actually set.
> >>
> >>{
> >>+ if (le32_to_cpu(sb->compat_features) &&
> >>+ le32_to_cpu(sb->compat_features) != FEATURE_FLAG_SUPPORTS_V190)
> >>{
> >> rs->ti->error = "Unable to assemble array: Unknown flag(s)
> >>in compatible feature flags";
> >> return -EINVAL;
> >> }
> >If the feature flags are single bit combinations then I believe the
> >below does check exactly that. Checking for no 1s outside of the
> >expected features, caring not for the value of the valid bits:
> >
> >+ if (le32_to_cpu(sb->compat_features) & ~(FEATURE_FLAG_SUPPORTS_V190)) {
> >
> >with the possibilty to or in additional feature bits as they are added.
>
> Thanks,
> I prefer this to be easier readable.
Readable or not, the code with the != is _not_ future-proof. Whereas
Andy's solution is. If/when a new compat feature comes along then
FEATURE_FLAG_SUPPORTS_V190 would be replaced to be a macro that ORs all
the new compat features together (e.g. FEATURE_FLAG_COMPAT). E.g. how
dm-thin-metadata.c:__check_incompat_features() does.
We can go with the != code for now, since any future changes would
likely cause this test to be changed. Or we could fix it now _for
real_.
Mike
^ permalink raw reply
* Re: growing a RAID-10 array with mdadm 3.3.1+ ?
From: Anthony Youngman @ 2016-10-11 18:29 UTC (permalink / raw)
To: moft, linux-raid
In-Reply-To: <1476206815.989242.752665505.6525EE00@webmail.messagingengine.com>
Okay, this is a first response, so you'll probably need more experienced
people to chime in, but
FIRST - BACKUP BACKUP!! BACKUP!!!!
Growing an array is pretty safe, but like anything here, it does have
its dangers.
Second, what distro are you running? Is it a systemd-based distro?
There are a few problems with resizing arrays at the moment, and my gut
feeling is that systemd is "to blame". It's very unlikely you'll lose
data, but you might well find the resize fails and you have copy to a
new array anyway.
More notes inline ...
On 11/10/16 18:26, moft@fmailbox.com wrote:
> Hi
>
> I have a 4-disk RAID10 array
>
> md0 : active raid10 sda1[4] sdb1[3] sdc1[2] sdd1[1]
> 1953259520 blocks super 1.2 512K chunks 2 far-copies [4/4] [UUUU]
> bitmap: 0/15 pages [0KB], 65536KB chunk
>
> It was created with this command
>
> mdadm --create /dev/md0 --level=raid10 --raid-devices=4 \
> --name=md0 --homehost="<none>" \
> --metadata=1.2 --bitmap=internal --layout=f2 --chunk=512 \
> /dev/sd[abcd]1
>
> It's running on a linux machine
>
> uname -rm
> 4.8.1-2.g4861355-default x86_64
> mdadm --version
> mdadm - v3.3.1 - 5th June 2014
>
> I need to add storage to the array.
>
> I'd like to grow it by adding two disks (/dev/sd[ef]), to end up with a 6-disk array.
>
> I know I can completely wipe it out and recreate it with 6-disks.
>
> But I'd rather grow/extend it, Instead.
>
> *CAN* I safely grow/expand it?
Bugs excepted - yes you should be able to, without problems.
>
> The ChangeLog for mdadm 3.3.1 says
>
> Changes Prior to release 3.3
> - Some array reshapes can proceed without needing backup file.
> This is done by changing the 'data_offset' so we never need to write
> any data back over where it was before. If there is no "head space"
> or "tail space" to allow data_offset to change, the old mechanism
> with a backup file can still be used.
If you're growing the array, you shouldn't need a backup file. You might
need a backup for the first second or so, but then it's no longer
necessary. And mdadm can probably use the space in the two new disks to
store the backup data.
(What I understand happens, is that mdadm will read old stripes 1 & 2.
It then writes new stripe 1 and sets the watermark to stripe 1. That
says that the new array is complete up to 1, and if the data isn't
there, fetch it from the old array. It then reads old stripe 3 and
writes new stripe 2, then sets the watermark to 2. Old 4 & 5 become new
3, then old 6 makes new 4. Etc etc. Plus, of course, all the locking and
safeguards to make sure nothing reads the stripe that's actively being
updated ... :-)
Anyways, if it needs a backup file, it will tell you.
> - RAID10 arrays can be reshaped to change the number of devices,
> change the chunk size, or change the layout between 'near'
> and 'offset'.
> This will always change data_offset, and will fail if there is no
> room for data_offset to be moved.
>
> So far I haven't found any specific "how to" for this process.
mdadm /dev/md0 --add /dev/sde1 /dev/sdf1
mdadm --grow /dev/md0 --raid-devices=6
The first command will add your two drives as spares. The second will
make them part of the array. It's the second command that's the risky
one... and bearing in mind I don't know raid10, it might just add them
on the end and not need any reconstruction at all ...
>
> (1) The changelog refers to 'near' and 'offset' layouts, but doesn't mention 'far'.
>
> CAN I safely grow this layout=f2 array ?
>
> (2) If I can, what's the detailed procedure to do it?
>
I'll be interested in knowing how this pans out, too, so I can add it to
the wiki :-)
Cheers,
Wol
^ permalink raw reply
* Re: growing a RAID-10 array with mdadm 3.3.1+ ?
From: moft @ 2016-10-11 18:37 UTC (permalink / raw)
To: Anthony Youngman, linux-raid
In-Reply-To: <a09a7ce1-8adf-e00c-7787-eb7ccc1a9e63@youngman.org.uk>
On Tue, Oct 11, 2016, at 11:29 AM, Anthony Youngman wrote:
> Growing an array is pretty safe, but like anything here, it does have its dangers.
>
> Second, what distro are you running? Is it a systemd-based distro?
Opensuse. Yes.
> feeling is that systemd is "to blame".
I have no idea why that'd be the case. That's the first time I've heard anybody suggest that.
> > *CAN* I safely grow/expand it?
>
> Bugs excepted - yes you should be able to, without problems.
So grouwing 'far' layouts are now supported? Do have a reference/source for that?
> > This will always change data_offset, and will fail if there is no
> > room for data_offset to be moved.
So a 'fail' means -- just won't start? as opposed to 'oops, it's now broken'?
> > So far I haven't found any specific "how to" for this process.
>
> mdadm /dev/md0 --add /dev/sde1 /dev/sdf1
> mdadm --grow /dev/md0 --raid-devices=6
>
> The first command will add your two drives as spares. The second will
> make them part of the array. It's the second command that's the risky
> one... and bearing in mind I don't know raid10, it might just add them
> on the end and not need any reconstruction at all ...
Well, that's the missing critical detail here.
> > (1) The changelog refers to 'near' and 'offset' layouts, but doesn't mention 'far'.
> >
> > CAN I safely grow this layout=f2 array ?
> >
> > (2) If I can, what's the detailed procedure to do it?
Still need to understand the 'far' support, namely yes/no.
> I'll be interested in knowing how this pans out, too, so I can add it to
> the wiki :-)
Thanks
Mike
^ permalink raw reply
* Re: growing a RAID-10 array with mdadm 3.3.1+ ?
From: Anthony Youngman @ 2016-10-11 18:50 UTC (permalink / raw)
To: moft, linux-raid
In-Reply-To: <1476211037.1004490.752738017.68CF1E3E@webmail.messagingengine.com>
On 11/10/16 19:37, moft@fmailbox.com wrote:
>> feeling is that systemd is "to blame".
> I have no idea why that'd be the case. That's the first time I've heard anybody suggest that.
>
That's why it's a "gut feel" :-)
But the impression I'm getting is that when mdadm runs in the foreground
with root's permissions it runs fine. When it detects systemd and
backgrounds into daemon mode, something goes wrong.
But I repeat - this is just a gut feel. I could be completely wrong :-)
Cheers,
Wol
^ permalink raw reply
* Re: growing a RAID-10 array with mdadm 3.3.1+ ?
From: Phil Turmel @ 2016-10-11 20:26 UTC (permalink / raw)
To: moft, linux-raid
In-Reply-To: <1476206815.989242.752665505.6525EE00@webmail.messagingengine.com>
On 10/11/2016 01:26 PM, moft@fmailbox.com wrote:
> (1) The changelog refers to 'near' and 'offset' layouts, but doesn't mention 'far'.
Historically "far" has not been reshapeable at all. I don't recall
seeing a patch that implemented it. If you attempt it and it doesn't
support it, mdadm will refuse without hurting your array. Same is true
for other reasons to reject growing. mdadm gives an error before
touching the array.
You can get a definitive answer by setting up a set of small loop
devices in an array that mimics your setup and attempting to grow that
test array.
There have been bugs with SElinux and systemd preventing the reshape
task from forking properly from the command line tool. The array is
then stuck at reshape position 0.
Phil
^ permalink raw reply
* [PATCH] super1: make write_bitmap1 compatible with previous mdadm versions
From: Guoqing Jiang @ 2016-10-12 6:24 UTC (permalink / raw)
To: Jes.Sorensen; +Cc: linux-raid, Guoqing Jiang, Neil Brown
For older mdadm version, v1.x metadata has different bitmap_offset,
we can't ensure all the bitmaps are on a 4K boundary since writing
4K for bitmap could corrupt the superblock, and Anthony reported
the bug about it at below link.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=837964
So let's check about the alignment for bitmap_offset before set
the boundary to 4096 unconditionally. Thanks for Neil's detailed
explanation.
Reported-by: Anthony DeRobertis <anthony@derobert.net>
Fixes: 95a05b37e8eb ("Create n bitmaps for clustered mode")
Cc: Neil Brown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
super1.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/super1.c b/super1.c
index 9f62d23..4fef378 100644
--- a/super1.c
+++ b/super1.c
@@ -2433,7 +2433,15 @@ static int write_bitmap1(struct supertype *st, int fd, enum bitmap_update update
memset(buf, 0xff, 4096);
memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
- towrite = calc_bitmap_size(bms, 4096);
+ /*
+ * use 4096 boundary if bitmap_offset is aligned
+ * with 8 sectors, then it should compatible with
+ * older mdadm.
+ */
+ if (__le32_to_cpu(sb->bitmap_offset) & 7)
+ towrite = calc_bitmap_size(bms, 512);
+ else
+ towrite = calc_bitmap_size(bms, 4096);
while (towrite > 0) {
n = towrite;
if (n > 4096)
--
2.6.6
^ permalink raw reply related
* Re: [PATCH 24/54] md/raid1: Improve another size determination in setup_conf()
From: Dan Carpenter @ 2016-10-12 8:28 UTC (permalink / raw)
To: Jes Sorensen
Cc: Richard Weinberger, SF Markus Elfring, linux-raid@vger.kernel.org,
Christoph Hellwig, Guoqing Jiang, Jens Axboe, Mike Christie,
Neil Brown, Shaohua Li, Tomasz Majchrzak, LKML,
kernel-janitors@vger.kernel.org, Julia Lawall
In-Reply-To: <wrfjvax0bbb4.fsf@redhat.com>
Compare:
foo = kmalloc(sizeof(*foo), GFP_KERNEL);
This says you are allocating enough space for foo. It can be reviewed
by looking at one line. If you change the type of foo it will still
work.
foo = kmalloc(sizeof(struct whatever), GFP_KERNEL);
There isn't enough information to say if this is correct. If you change
the type of foo then you have to update the allocation as well.
It's not a super common type of bug, but I see it occasionally.
regards,
dan carpenter
^ permalink raw reply
* Re: [PATCH] badblocks: fix overlapping check for clearing
From: Tomasz Majchrzak @ 2016-10-12 10:26 UTC (permalink / raw)
To: Dan Williams, linux-block; +Cc: NeilBrown, linux-raid
In-Reply-To: <CAA9_cmcnQjKv=tdBF+Jkd6WoFh+aKCVzq7NPQBStwtcv_j0Qyg@mail.gmail.com>
On Mon, Oct 10, 2016 at 03:32:58PM -0700, Dan Williams wrote:
> > On Tue, Sep 06 2016, Tomasz Majchrzak wrote:
> >> ---
> >> block/badblocks.c | 6 ++++--
> >> 1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/badblocks.c b/block/badblocks.c
> >> index 7be53cb..b2ffcc7 100644
> >> --- a/block/badblocks.c
> >> +++ b/block/badblocks.c
> >> @@ -354,7 +354,8 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
> >> * current range. Earlier ranges could also overlap,
> >> * but only this one can overlap the end of the range.
> >> */
> >> - if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
> >> + if ((BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) &&
> >> + (BB_OFFSET(p[lo]) <= target)) {
> >
> > hmmm..
> > 'target' is the sector just beyond the set of sectors to remove from the
> > list.
> > BB_OFFSET(p[lo]) is the first sector in a range that was found in the
> > list.
> > If these are equal, then are aren't clearing anything in this range.
> > So I would have '<', not '<='.
> >
> > I don't think this makes the code wrong as we end up assigning to p[lo]
> > the value that is already there. But it might be confusing.
> >
> >
> >> /* Partial overlap, leave the tail of this range */
> >> int ack = BB_ACK(p[lo]);
> >> sector_t a = BB_OFFSET(p[lo]);
> >> @@ -377,7 +378,8 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
> >> lo--;
> >> }
> >> while (lo >= 0 &&
> >> - BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
> >> + (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) &&
> >> + (BB_OFFSET(p[lo]) <= target)) {
> >
> > Ditto.
> >
> > But the code is, I think, correct. Just not how I would have written it.
> > So
> >
> > Acked-by: NeilBrown <neilb@suse.com>
>
> I agree with the comments to change "<=" to "<". Tomasz, care to
> re-send with those changes?
I have just resent the patch with your suggestions included.
> > In the original md context, it would only ever be called on a block that
> > was already in the list.
Actually MD RAID10 calls it this way. See handle_write_completed, it iterates
over all copies and clears the bad block if error has not been returned. I have
a test case which fails for that reason - existing bad block is modified by
clear block. It is very unlikely to happen in real life as it depends on
specific layout of bad blocks and their discovery order, however it's a gap that
needs to be closed.
I had put some effort to see if clearing of non-existing bad block in RAID10 can
lead to some incorrect behaviour but I haven't found any. It seems that my patch
is sufficient to fix the problem.
Tomek
^ permalink raw reply
* Re: [PATCH 24/54] md/raid1: Improve another size determination in setup_conf()
From: Jes Sorensen @ 2016-10-12 12:18 UTC (permalink / raw)
To: Dan Carpenter
Cc: Richard Weinberger, SF Markus Elfring, linux-raid@vger.kernel.org,
Christoph Hellwig, Guoqing Jiang, Jens Axboe, Mike Christie,
Neil Brown, Shaohua Li, Tomasz Majchrzak, LKML,
kernel-janitors@vger.kernel.org, Julia Lawall
In-Reply-To: <20161012074307.GB5687@mwanda>
Dan Carpenter <dan.carpenter@oracle.com> writes:
> Compare:
>
> foo = kmalloc(sizeof(*foo), GFP_KERNEL);
>
> This says you are allocating enough space for foo. It can be reviewed
> by looking at one line. If you change the type of foo it will still
> work.
>
> foo = kmalloc(sizeof(struct whatever), GFP_KERNEL);
>
> There isn't enough information to say if this is correct. If you change
> the type of foo then you have to update the allocation as well.
>
> It's not a super common type of bug, but I see it occasionally.
I know what you are saying, but the latter in my book is easier to read
and reminds you what the type is when you review the code.
Point being this comes down to personal preference and stating that the
former is the right way or making that a rule and using checkpatch to
harrass people with patches to change it is bogus.
Jes
^ permalink raw reply
* [PATCH] imsm: block chunk size change for RAID 10
From: Mariusz Dabrowski @ 2016-10-12 12:28 UTC (permalink / raw)
To: linux-raid
Cc: Jes.Sorensen, tomasz.majchrzak, aleksey.obitotskiy,
pawel.baldysiak, artur.paszkiewicz, maksymilian.kunt,
Mariusz Dabrowski
Chunk size change of RAID 10 array fails because it is not supported but
invalid values still are being written to metadata and array cannot be
assembled after stop. Operation should be blocked before metadata update.
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
---
super-intel.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/super-intel.c b/super-intel.c
index 92817e9..0b3b2b1 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -10074,10 +10074,16 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
}
if ((geo->chunksize > 0) && (geo->chunksize != UnSet)
- && (geo->chunksize != info.array.chunk_size))
+ && (geo->chunksize != info.array.chunk_size)) {
+ if (info.array.level == 10) {
+ pr_err("Error. Chunk size change for RAID 10 is not supported.\n");
+ change = -1;
+ goto analyse_change_exit;
+ }
change = CH_MIGRATION;
- else
+ } else {
geo->chunksize = info.array.chunk_size;
+ }
chunk = geo->chunksize / 1024;
--
1.8.3.1
^ permalink raw reply related
* [PATCH] Allow level migration only for single-array container
From: Mariusz Dabrowski @ 2016-10-12 12:29 UTC (permalink / raw)
To: linux-raid
Cc: Jes.Sorensen, tomasz.majchrzak, aleksey.obitotskiy,
pawel.baldysiak, artur.paszkiewicz, maksymilian.kunt,
Mariusz Dabrowski
IMSM doesn't allow to change RAID level of array in container with two
arrays but array count check is being done too late (after removing disks)
and in some cases (e. g. RAID 0 and RAID 1 migrated to RAID 0) both arrays
become degraded. This patch adds array count check before disks are being
removed.
Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
---
Grow.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/Grow.c b/Grow.c
index 628f0e7..bcd27f5 100755
--- a/Grow.c
+++ b/Grow.c
@@ -777,6 +777,25 @@ int remove_disks_for_takeover(struct supertype *st,
struct mdinfo *remaining;
int slot;
+ if (st->ss->external) {
+ int rv = 0;
+ struct mdinfo *arrays = st->ss->container_content(st, NULL);
+ /* containter_content returns list of arrays in container
+ * If arrays->next is not NULL it means that there are 2 arrays in
+ * container and operation should be blocked
+ */
+ if (arrays) {
+ if (arrays->next)
+ rv = 1;
+ sysfs_free(arrays);
+ if (rv) {
+ pr_err("Error. Cannot perform operation on /dev/%s\n", st->devnm);
+ pr_err("For this operation it MUST be single array in container\n");
+ return rv;
+ }
+ }
+ }
+
if (sra->array.level == 10)
nr_of_copies = layout & 0xff;
else if (sra->array.level == 1)
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH v4 5/8] md/r5cache: reclaim support
From: Shaohua Li @ 2016-10-12 16:50 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161011002446.2002428-6-songliubraving@fb.com>
On Mon, Oct 10, 2016 at 05:24:43PM -0700, Song Liu wrote:
> There are two limited resources, stripe cache and journal disk space.
> For better performance, we priotize reclaim of full stripe writes.
> To free up more journal space, we free earliest data on the journal.
>
> In current implementation, reclaim happens when:
> 1. every R5C_RECLAIM_WAKEUP_INTERVAL (5 seconds)
This is the protection. If no reclaim runs in last 5 seconds, trigger this.
Otherwise, don't do this.
> 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (8) cached full stripes
> (r5c_check_cached_full_stripe)
8 stripes aren't big. I'd use 128k so 32 strieps. 128k size IO should be good
for harddisk.
I'd suggest something like this:
1. if no stripe cache pressure and there are 32+ full stripes, flush all full
stripes
2. if stripe cache pressure is moderate, flush all full stripes
3. if stripe cache pressure is high, flush all full stripes first. If pressure
is still higher than a watermark, flush partial full stripes.
The principle is to flush full stripes if possible and flush as more as
possible. reclaim will do disk cache flush and so is an expensive operation.
> 3. when raid5_get_active_stripe sees pressure in stripe cache space
> (r5c_check_stripe_cache_usage)
> 4. when there is pressure in journal space.
>
> 1-3 above are straightforward. The following explains details of 4.
>
> To avoid deadlock due to log space, we need to reserve enough space
> to flush cached data. The size of required log space depends on total
> number of cached stripes (stripe_in_cache_count). In current
> implementation, the reclaim path automatically include pending
> data writes with parity writes (similar to write through case).
> Therefore, we need up to (conf->raid_disks + 1) pages for each cached
> stripe (1 page for meta data, raid_disks pages for all data and
> parity). r5c_log_required_to_flush_cache() calculates log space
> required to flush cache. In the following, we refer to the space
> calculated by r5c_log_required_to_flush_cache() as
> reclaim_required_space.
>
> Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
> R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
> device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
> is set when free space on the log device is less than 2x of
> reclaim_required_space.
>
> r5c_cache keeps all data in cache (not fully committed to RAID) in
> a list (stripe_in_cache_list). These stripes are in the order of their
> first appearance on the journal. So the log tail (last_checkpoint)
> should point to the journal_start of the first item in the list.
>
> When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
> stripes at the head of stripe_in_cache. When R5C_LOG_CRITICAL is
> set, the state machine only writes data that are already in the
> log device (in stripe_in_cache_list).
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/raid5-cache.c | 363 +++++++++++++++++++++++++++++++++++++++++++----
> drivers/md/raid5.c | 17 +++
> drivers/md/raid5.h | 39 +++--
> 3 files changed, 383 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 92d3d7b..2774f93 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -29,12 +29,21 @@
> #define BLOCK_SECTORS (8)
>
> /*
> - * reclaim runs every 1/4 disk size or 10G reclaimable space. This can prevent
> - * recovery scans a very long log
> + * log->max_free_space is min(1/4 disk size, 10G reclaimable space).
> + *
> + * In write through mode, the reclaim runs every log->max_free_space.
> + * This can prevent the recovery scans for too long
> */
> #define RECLAIM_MAX_FREE_SPACE (10 * 1024 * 1024 * 2) /* sector */
> #define RECLAIM_MAX_FREE_SPACE_SHIFT (2)
>
> +/* wake up reclaim thread periodically */
> +#define R5C_RECLAIM_WAKEUP_INTERVAL (5 * HZ)
> +/* start flush with these full stripes */
> +#define R5C_FULL_STRIPE_FLUSH_BATCH 8
> +/* reclaim stripes in groups */
> +#define R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2)
> +
> /*
> * We only need 2 bios per I/O unit to make progress, but ensure we
> * have a few more available to not get too tight.
> @@ -141,6 +150,11 @@ struct r5l_log {
>
> /* for r5c_cache */
> enum r5c_state r5c_state;
> + struct list_head stripe_in_cache_list; /* all stripes in r5cache, with
> + * sh->log_start in order
> + */
The comment isn't correct. log_start could wrap. The stripes are actually
ordered in seq. Please move the comments above stripe_in_cache_list.
> +/*
> + * r5c_flush_stripe moves stripe from cached list to handle_list. When called,
> + * the stripe must be on r5c_cached_full_stripes or r5c_cached_partial_stripes.
> + *
> + * must hold conf->device_lock
> + */
> +static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
> +{
> + BUG_ON(list_empty(&sh->lru));
> + BUG_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
> + BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
> + assert_spin_locked(&conf->device_lock);
> +
> + list_del_init(&sh->lru);
> + atomic_inc(&sh->count);
> +
> + set_bit(STRIPE_HANDLE, &sh->state);
> + atomic_inc(&conf->active_stripes);
> + r5c_freeze_stripe_for_reclaim(sh);
> +
> + set_bit(STRIPE_PREREAD_ACTIVE, &sh->state);
we increase preread_active_stripes too if the STRIPE_PREREAD_ACTIVE isn't set.
Why not here?
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH v4 6/8] md/r5cache: sysfs entry r5c_state
From: Shaohua Li @ 2016-10-12 16:56 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161011002446.2002428-7-songliubraving@fb.com>
On Mon, Oct 10, 2016 at 05:24:44PM -0700, Song Liu wrote:
> r5c_state have 4 states:
> * no-cache;
> * write-through (write journal only);
> * write-back (w/ write cache);
> * cache-broken (journal missing or Faulty)
>
> When there is functional write cache, r5c_state is a knob to
> switch between write-back and write-through.
>
> When the journal device is broken, the raid array is forced
> in readonly mode. In this case, r5c_state can be used to
> remove "journal feature", and thus make the array read-write
> without journal. By writing into r5c_cache_mode, the array
> can transit from cache-broken to no-cache, which removes
> journal feature for the array.
>
> To remove the journal feature:
> - When journal fails, the raid array is forced readonly mode
> (enforced by kernel)
> - User uses the new interface to remove journal (writing 0
> to r5c_state, I will add a mdadm option for that later)
> - User forces array read-write;
> - Kernel updates superblock and array can run read/write.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/raid5-cache.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++
> drivers/md/raid5.c | 1 +
> drivers/md/raid5.h | 1 +
> 3 files changed, 60 insertions(+)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 2774f93..b19024c 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -57,6 +57,8 @@ enum r5c_state {
> R5C_STATE_CACHE_BROKEN = 3,
> };
>
> +static char *r5c_state_str[] = {"no-cache", "write-through",
> + "write-back", "cache-broken"};
> /*
> * raid5 cache state machine
> *
> @@ -1519,6 +1521,62 @@ int r5c_flush_cache(struct r5conf *conf, int num)
> return count;
> }
>
> +ssize_t r5c_state_show(struct mddev *mddev, char *page)
> +{
> + struct r5conf *conf = mddev->private;
> + int val = 0;
> + int ret = 0;
> +
> + if (conf->log)
> + val = conf->log->r5c_state;
> + else if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
> + val = R5C_STATE_CACHE_BROKEN;
> + ret += snprintf(page, PAGE_SIZE - ret, "%d: %s\n",
> + val, r5c_state_str[val]);
No point to do PAGE_SIZE - ret
This isn't how sysfs entry is supposed to output. You can either show the value
or the string, not both with format. I'd prefer the string though, and make
store accept string.
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH v4 0/8] raid5-cache: enabling cache features
From: Shaohua Li @ 2016-10-12 17:52 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161011002446.2002428-1-songliubraving@fb.com>
On Mon, Oct 10, 2016 at 05:24:38PM -0700, Song Liu wrote:
> These are the 4th version of patches to enable write cache part of
> raid5-cache. The journal part was released with kernel 4.4.
>
> The caching part uses same disk format of raid456 journal, and provides
> acceleration to writes. Write operations are committed (bio_endio) once
> the data is secured in journal. Reconstruct and RMW are postponed to
> reclaim path, which is (hopefully) not on the critical path.
>
> The changes are organized in 8 patches (details below).
>
> Patch for chunk_aligned_read in earlier RFC is not included yet
> (http://marc.info/?l=linux-raid&m=146432700719277). But we may still need
> some optimizations later, especially for SSD raid devices.
>
> Changes from PATCH v3 (http://marc.info/?l=linux-raid&m=147573807306070):
> 1. Make reclaim robust
> 2. Fix a bug in recovery
>
> Changes between v3 and v2 (http://marc.info/?l=linux-raid&m=147493266208102):
> 1. Incorporate feedback from Shaohua
> 2. Reorganize the patches, for hopefully easier review
> 3. Make sure no change to write through mode (journal only)
> 4. Change reclaim design to avoid deadlock due to log space
Could you please add a test case for this in mdadm test suites?
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH v4 6/8] md/r5cache: sysfs entry r5c_state
From: Song Liu @ 2016-10-12 21:23 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-raid@vger.kernel.org, neilb@suse.com, Shaohua Li,
Kernel Team, dan.j.williams@intel.com, hch@infradead.org,
liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <20161012165647.GB15323@kernel.org>
> On Oct 12, 2016, at 9:56 AM, Shaohua Li <shli@kernel.org> wrote:
>
>>
>
> This isn't how sysfs entry is supposed to output. You can either show the value
> or the string, not both with format. I'd prefer the string though, and make
> store accept string.
I implemented something like the following:
root@virt-test:~/md# cat /sys/block/md0/md/r5c_state
[write-through] write-back
root@virt-test:~/md# echo write-back > /sys/block/md0/md/r5c_state
root@virt-test:~/md# cat /sys/block/md0/md/r5c_state
write-through [write-back]
root@virt-test:~/md# ./mdadm --fail /dev/md0 /dev/loop4
mdadm: set /dev/loop4 faulty in /dev/md0
root@virt-test:~/md# cat /sys/block/md0/md/r5c_state
no-cache [cache-broken]
root@virt-test:~/md# echo no-cache > /sys/block/md0/md/r5c_state
root@virt-test:~/md# cat /sys/block/md0/md/r5c_state
no-cache
It will be part of next version.
Thanks,
Song
^ permalink raw reply
* [PATCH v5 0/8] raid5-cache: enabling cache features
From: Song Liu @ 2016-10-13 5:49 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
These are the 5th version of patches to enable write cache part of
raid5-cache. The journal part was released with kernel 4.4.
The caching part uses same disk format of raid456 journal, and provides
acceleration to writes. Write operations are committed (bio_endio) once
the data is secured in journal. Reconstruct and RMW are postponed to
reclaim path, which is (hopefully) not on the critical path.
The changes are organized in 8 patches (details below).
Patch for chunk_aligned_read in earlier RFC is not included yet
(http://marc.info/?l=linux-raid&m=146432700719277). But we may still need
some optimizations later, especially for SSD raid devices.
Changes between v5 and v4 (http://marc.info/?l=linux-raid&m=147629531615172)
1. Change the output/input of sysfs entry r5c_state
2. Move heavy reclaim work from raid5_make_request() to r5c_do_reclaim()
3. Fix an issue with orig_page handling in the write path
Changes between v4 and v3 (http://marc.info/?l=linux-raid&m=147573807306070):
1. Make reclaim robust
2. Fix a bug in recovery
Changes between v3 and v2 (http://marc.info/?l=linux-raid&m=147493266208102):
1. Incorporate feedback from Shaohua
2. Reorganize the patches, for hopefully easier review
3. Make sure no change to write through mode (journal only)
4. Change reclaim design to avoid deadlock due to log space
Thanks,
Song
Song Liu (8):
md/r5cache: Check array size in r5l_init_log
md/r5cache: move some code to raid5.h
md/r5cache: State machine for raid5-cache write back mode
md/r5cache: write part of r5cache
md/r5cache: reclaim support
md/r5cache: sysfs entry r5c_state
md/r5cache: r5c recovery
md/r5cache: handle SYNC and FUA
drivers/md/raid5-cache.c | 1659 ++++++++++++++++++++++++++++++++++++++++------
drivers/md/raid5.c | 261 +++++---
drivers/md/raid5.h | 150 ++++-
3 files changed, 1772 insertions(+), 298 deletions(-)
--
2.9.3
^ permalink raw reply
* [PATCH v5 1/8] md/r5cache: Check array size in r5l_init_log
From: Song Liu @ 2016-10-13 5:49 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
In-Reply-To: <20161013054944.1038806-1-songliubraving@fb.com>
Currently, r5l_write_stripe checks meta size for each stripe write,
which is not necessary.
With this patch, r5l_init_log checks maximal meta size of the array,
which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
If this is too big to fit in one page, r5l_init_log aborts.
With current meta data, r5l_log support raid_disks up to 203.
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5-cache.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 1b1ab4a..7557791b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -441,7 +441,6 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
{
int write_disks = 0;
int data_pages, parity_pages;
- int meta_size;
int reserve;
int i;
int ret = 0;
@@ -473,15 +472,6 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
parity_pages = 1 + !!(sh->qd_idx >= 0);
data_pages = write_disks - parity_pages;
- meta_size =
- ((sizeof(struct r5l_payload_data_parity) + sizeof(__le32))
- * data_pages) +
- sizeof(struct r5l_payload_data_parity) +
- sizeof(__le32) * parity_pages;
- /* Doesn't work with very big raid array */
- if (meta_size + sizeof(struct r5l_meta_block) > PAGE_SIZE)
- return -EINVAL;
-
set_bit(STRIPE_LOG_TRAPPED, &sh->state);
/*
* The stripe must enter state machine again to finish the write, so
@@ -1184,6 +1174,22 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
if (PAGE_SIZE != 4096)
return -EINVAL;
+
+ /*
+ * The PAGE_SIZE must be big enough to hold 1 r5l_meta_block and
+ * raid_disks r5l_payload_data_parity.
+ *
+ * Write journal and cache does not work for very big array
+ * (raid_disks > 203)
+ */
+ if (sizeof(struct r5l_meta_block) +
+ ((sizeof(struct r5l_payload_data_parity) + sizeof(__le32)) *
+ conf->raid_disks) > PAGE_SIZE) {
+ pr_err("md/raid:%s: write journal/cache doesn't work for array with %d disks\n",
+ mdname(conf->mddev), conf->raid_disks);
+ return -EINVAL;
+ }
+
log = kzalloc(sizeof(*log), GFP_KERNEL);
if (!log)
return -ENOMEM;
--
2.9.3
^ permalink raw reply related
* [PATCH v5 2/8] md/r5cache: move some code to raid5.h
From: Song Liu @ 2016-10-13 5:49 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
In-Reply-To: <20161013054944.1038806-1-songliubraving@fb.com>
Move some define and inline functions to raid5.h, so they can be
used in raid5-cache.c
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5.c | 71 -------------------------------------------------
drivers/md/raid5.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 77 insertions(+), 71 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f94472d..67d4f49 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -70,19 +70,6 @@ module_param(devices_handle_discard_safely, bool, 0644);
MODULE_PARM_DESC(devices_handle_discard_safely,
"Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
static struct workqueue_struct *raid5_wq;
-/*
- * Stripe cache
- */
-
-#define NR_STRIPES 256
-#define STRIPE_SIZE PAGE_SIZE
-#define STRIPE_SHIFT (PAGE_SHIFT - 9)
-#define STRIPE_SECTORS (STRIPE_SIZE>>9)
-#define IO_THRESHOLD 1
-#define BYPASS_THRESHOLD 1
-#define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head))
-#define HASH_MASK (NR_HASH - 1)
-#define MAX_STRIPE_BATCH 8
static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect)
{
@@ -126,64 +113,6 @@ static inline void unlock_all_device_hash_locks_irq(struct r5conf *conf)
local_irq_enable();
}
-/* bio's attached to a stripe+device for I/O are linked together in bi_sector
- * order without overlap. There may be several bio's per stripe+device, and
- * a bio could span several devices.
- * When walking this list for a particular stripe+device, we must never proceed
- * beyond a bio that extends past this device, as the next bio might no longer
- * be valid.
- * This function is used to determine the 'next' bio in the list, given the sector
- * of the current stripe+device
- */
-static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
-{
- int sectors = bio_sectors(bio);
- if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS)
- return bio->bi_next;
- else
- return NULL;
-}
-
-/*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
- */
-static inline int raid5_bi_processed_stripes(struct bio *bio)
-{
- atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
- return (atomic_read(segments) >> 16) & 0xffff;
-}
-
-static inline int raid5_dec_bi_active_stripes(struct bio *bio)
-{
- atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
- return atomic_sub_return(1, segments) & 0xffff;
-}
-
-static inline void raid5_inc_bi_active_stripes(struct bio *bio)
-{
- atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
- atomic_inc(segments);
-}
-
-static inline void raid5_set_bi_processed_stripes(struct bio *bio,
- unsigned int cnt)
-{
- atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
- int old, new;
-
- do {
- old = atomic_read(segments);
- new = (old & 0xffff) | (cnt << 16);
- } while (atomic_cmpxchg(segments, old, new) != old);
-}
-
-static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
-{
- atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
- atomic_set(segments, cnt);
-}
-
/* Find first data disk in a raid6 stripe */
static inline int raid6_d0(struct stripe_head *sh)
{
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 517d4b6..46cfe93 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -410,6 +410,83 @@ struct disk_info {
struct md_rdev *rdev, *replacement;
};
+/*
+ * Stripe cache
+ */
+
+#define NR_STRIPES 256
+#define STRIPE_SIZE PAGE_SIZE
+#define STRIPE_SHIFT (PAGE_SHIFT - 9)
+#define STRIPE_SECTORS (STRIPE_SIZE>>9)
+#define IO_THRESHOLD 1
+#define BYPASS_THRESHOLD 1
+#define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head))
+#define HASH_MASK (NR_HASH - 1)
+#define MAX_STRIPE_BATCH 8
+
+/* bio's attached to a stripe+device for I/O are linked together in bi_sector
+ * order without overlap. There may be several bio's per stripe+device, and
+ * a bio could span several devices.
+ * When walking this list for a particular stripe+device, we must never proceed
+ * beyond a bio that extends past this device, as the next bio might no longer
+ * be valid.
+ * This function is used to determine the 'next' bio in the list, given the
+ * sector of the current stripe+device
+ */
+static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
+{
+ int sectors = bio_sectors(bio);
+
+ if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS)
+ return bio->bi_next;
+ else
+ return NULL;
+}
+
+/*
+ * We maintain a biased count of active stripes in the bottom 16 bits of
+ * bi_phys_segments, and a count of processed stripes in the upper 16 bits
+ */
+static inline int raid5_bi_processed_stripes(struct bio *bio)
+{
+ atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+ return (atomic_read(segments) >> 16) & 0xffff;
+}
+
+static inline int raid5_dec_bi_active_stripes(struct bio *bio)
+{
+ atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+ return atomic_sub_return(1, segments) & 0xffff;
+}
+
+static inline void raid5_inc_bi_active_stripes(struct bio *bio)
+{
+ atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+ atomic_inc(segments);
+}
+
+static inline void raid5_set_bi_processed_stripes(struct bio *bio,
+ unsigned int cnt)
+{
+ atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+ int old, new;
+
+ do {
+ old = atomic_read(segments);
+ new = (old & 0xffff) | (cnt << 16);
+ } while (atomic_cmpxchg(segments, old, new) != old);
+}
+
+static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
+{
+ atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+ atomic_set(segments, cnt);
+}
+
/* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
* This is because we sometimes take all the spinlocks
* and creating that much locking depth can cause
--
2.9.3
^ permalink raw reply related
* [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode
From: Song Liu @ 2016-10-13 5:49 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
In-Reply-To: <20161013054944.1038806-1-songliubraving@fb.com>
The raid5-cache write back mode as 2 states for each stripe: write state
and reclaim state. This patch adds bare state machine for these two
states.
2 flags are added to sh->state for raid5-cache states:
- STRIPE_R5C_FROZEN
- STRIPE_R5C_WRITTEN
STRIPE_R5C_FROZEN is the key flag to differentiate write state
and reclaim state.
STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from
reclaim state back to write state.
In write through mode, every stripe also goes between write state
and reclaim state (in r5c_handle_stripe_dirtying() and
r5c_handle_stripe_written()).
Please note: this is a "no-op" patch for raid5-cache write through
mode.
The following detailed explanation is copied from the raid5-cache.c:
/*
* raid5 cache state machine
*
* The RAID cache works in two major states for each stripe: write state
* and reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN
* and STRIPE_R5C_WRITTEN
*
* STRIPE_R5C_FROZEN is the key flag to differentiate write state and reclaim
* state. The write state runs w/ STRIPE_R5C_FROZEN == 0. While the reclaim
* state runs w/ STRIPE_R5C_FROZEN == 1.
*
* STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from reclaim
* state to write state. Specifically, STRIPE_R5C_WRITTEN triggers clean up
* process in r5c_handle_stripe_written. STRIPE_R5C_WRITTEN is set when data
* and parity of a stripe is all in journal device; and cleared when the data
* and parity are all in RAID disks.
*
* The following is another way to show how STRIPE_R5C_FROZEN and
* STRIPE_R5C_WRITTEN work:
*
* write state: STRIPE_R5C_FROZEN = 0 STRIPE_R5C_WRITTEN = 0
* reclaim state: STRIPE_R5C_FROZEN = 1
*
* write => reclaim: set STRIPE_R5C_FROZEN in r5c_freeze_stripe_for_reclaim
* reclaim => write:
* 1. write parity to journal, when finished, set STRIPE_R5C_WRITTEN
* 2. write data/parity to raid disks, when finished, clear both
* STRIPE_R5C_FROZEN and STRIPE_R5C_WRITTEN
*
* In write through mode (journal only) the stripe still goes through these
* state change, except that STRIPE_R5C_FROZEN is set on write in
* r5c_handle_stripe_dirtying().
*/
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5-cache.c | 125 +++++++++++++++++++++++++++++++++++++++++++++--
drivers/md/raid5.c | 20 ++++++--
drivers/md/raid5.h | 10 +++-
3 files changed, 148 insertions(+), 7 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 7557791b..9e05850 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -40,6 +40,47 @@
*/
#define R5L_POOL_SIZE 4
+enum r5c_state {
+ R5C_STATE_NO_CACHE = 0,
+ R5C_STATE_WRITE_THROUGH = 1,
+ R5C_STATE_WRITE_BACK = 2,
+ R5C_STATE_CACHE_BROKEN = 3,
+};
+
+/*
+ * raid5 cache state machine
+ *
+ * The RAID cache works in two major states for each stripe: write state and
+ * reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN and
+ * STRIPE_R5C_WRITTEN
+ *
+ * STRIPE_R5C_FROZEN is the key flag to differentiate write state and reclaim
+ * state. The write state runs w/ STRIPE_R5C_FROZEN == 0. While the reclaim
+ * state runs w/ STRIPE_R5C_FROZEN == 1.
+ *
+ * STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from reclaim
+ * state to write state. Specifically, STRIPE_R5C_WRITTEN triggers clean up
+ * process in r5c_handle_stripe_written. STRIPE_R5C_WRITTEN is set when data
+ * and parity of a stripe is all in journal device; and cleared when the data
+ * and parity are all in RAID disks.
+ *
+ * The following is another way to show how STRIPE_R5C_FROZEN and
+ * STRIPE_R5C_WRITTEN work:
+ *
+ * write state: STRIPE_R5C_FROZEN = 0 STRIPE_R5C_WRITTEN = 0
+ * reclaim state: STRIPE_R5C_FROZEN = 1
+ *
+ * write => reclaim: set STRIPE_R5C_FROZEN in r5c_freeze_stripe_for_reclaim
+ * reclaim => write:
+ * 1. write parity to journal, when finished, set STRIPE_R5C_WRITTEN
+ * 2. write data/parity to raid disks, when finished, clear both
+ * STRIPE_R5C_FROZEN and STRIPE_R5C_WRITTEN
+ *
+ * In write through mode (journal only) the stripe also goes through these
+ * state change, except that STRIPE_R5C_FROZEN is set on write in
+ * r5c_handle_stripe_dirtying().
+ */
+
struct r5l_log {
struct md_rdev *rdev;
@@ -96,6 +137,9 @@ struct r5l_log {
spinlock_t no_space_stripes_lock;
bool need_cache_flush;
+
+ /* for r5c_cache */
+ enum r5c_state r5c_state;
};
/*
@@ -133,6 +177,11 @@ enum r5l_io_unit_state {
IO_UNIT_STRIPE_END = 3, /* stripes data finished writing to raid */
};
+bool r5c_is_writeback(struct r5l_log *log)
+{
+ return (log != NULL && log->r5c_state == R5C_STATE_WRITE_BACK);
+}
+
static sector_t r5l_ring_add(struct r5l_log *log, sector_t start, sector_t inc)
{
start += inc;
@@ -168,12 +217,44 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
io->state = state;
}
+/*
+ * Freeze the stripe, thus send the stripe into reclaim path.
+ *
+ * In current implementation, STRIPE_R5C_FROZEN is also set in write through
+ * mode (in r5c_handle_stripe_dirtying). This does not change the behavior of
+ * for write through mode.
+ */
+void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
+{
+ struct r5conf *conf = sh->raid_conf;
+ struct r5l_log *log = conf->log;
+
+ if (!log)
+ return;
+ WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
+ set_bit(STRIPE_R5C_FROZEN, &sh->state);
+}
+
+static void r5c_finish_cache_stripe(struct stripe_head *sh)
+{
+ struct r5l_log *log = sh->raid_conf->log;
+
+ if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
+ BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
+ set_bit(STRIPE_R5C_WRITTEN, &sh->state);
+ } else
+ BUG(); /* write back logic in next patch */
+}
+
static void r5l_io_run_stripes(struct r5l_io_unit *io)
{
struct stripe_head *sh, *next;
list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
list_del_init(&sh->log_list);
+
+ r5c_finish_cache_stripe(sh);
+
set_bit(STRIPE_HANDLE, &sh->state);
raid5_release_stripe(sh);
}
@@ -412,18 +493,19 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
r5l_append_payload_page(log, sh->dev[i].page);
}
- if (sh->qd_idx >= 0) {
+ if (parity_pages == 2) {
r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
sh->sector, sh->dev[sh->pd_idx].log_checksum,
sh->dev[sh->qd_idx].log_checksum, true);
r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
r5l_append_payload_page(log, sh->dev[sh->qd_idx].page);
- } else {
+ } else if (parity_pages == 1) {
r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
sh->sector, sh->dev[sh->pd_idx].log_checksum,
0, false);
r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
- }
+ } else
+ BUG_ON(parity_pages != 0);
list_add_tail(&sh->log_list, &io->stripe_list);
atomic_inc(&io->pending_stripe);
@@ -455,6 +537,8 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
return -EAGAIN;
}
+ WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
+
for (i = 0; i < sh->disks; i++) {
void *addr;
@@ -1101,6 +1185,39 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp)
set_bit(MD_CHANGE_DEVS, &mddev->flags);
}
+int r5c_handle_stripe_dirtying(struct r5conf *conf,
+ struct stripe_head *sh,
+ struct stripe_head_state *s,
+ int disks)
+{
+ struct r5l_log *log = conf->log;
+
+ if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
+ return -EAGAIN;
+
+ if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH ||
+ conf->mddev->degraded != 0) {
+ /* write through mode */
+ r5c_freeze_stripe_for_reclaim(sh);
+ return -EAGAIN;
+ }
+ BUG(); /* write back logic in next commit */
+ return 0;
+}
+
+/*
+ * clean up the stripe (clear STRIPE_R5C_FROZEN etc.) after the stripe is
+ * committed to RAID disks
+*/
+void r5c_handle_stripe_written(struct r5conf *conf,
+ struct stripe_head *sh)
+{
+ if (!test_and_clear_bit(STRIPE_R5C_WRITTEN, &sh->state))
+ return;
+ WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
+ clear_bit(STRIPE_R5C_FROZEN, &sh->state);
+}
+
static int r5l_load_log(struct r5l_log *log)
{
struct md_rdev *rdev = log->rdev;
@@ -1236,6 +1353,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
INIT_LIST_HEAD(&log->no_space_stripes);
spin_lock_init(&log->no_space_stripes_lock);
+ log->r5c_state = R5C_STATE_WRITE_THROUGH;
+
if (r5l_load_log(log))
goto error;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 67d4f49..2e3e61a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3506,6 +3506,9 @@ static void handle_stripe_dirtying(struct r5conf *conf,
int rmw = 0, rcw = 0, i;
sector_t recovery_cp = conf->mddev->recovery_cp;
+ if (r5c_handle_stripe_dirtying(conf, sh, s, disks) == 0)
+ return;
+
/* Check whether resync is now happening or should start.
* If yes, then the array is dirty (after unclean shutdown or
* initial creation), so parity in some stripes might be inconsistent.
@@ -4396,13 +4399,23 @@ static void handle_stripe(struct stripe_head *sh)
|| s.expanding)
handle_stripe_fill(sh, &s, disks);
- /* Now to consider new write requests and what else, if anything
- * should be read. We do not handle new writes when:
+ /*
+ * When the stripe finishes full journal write cycle (write to journal
+ * and raid disk), this is the clean up procedure so it is ready for
+ * next operation.
+ */
+ r5c_handle_stripe_written(conf, sh);
+
+ /*
+ * Now to consider new write requests, cache write back and what else,
+ * if anything should be read. We do not handle new writes when:
* 1/ A 'write' operation (copy+xor) is already in flight.
* 2/ A 'check' operation is in flight, as it may clobber the parity
* block.
+ * 3/ A r5c cache log write is in flight.
*/
- if (s.to_write && !sh->reconstruct_state && !sh->check_state)
+ if ((s.to_write || test_bit(STRIPE_R5C_FROZEN, &sh->state)) &&
+ !sh->reconstruct_state && !sh->check_state && !sh->log_io)
handle_stripe_dirtying(conf, sh, &s, disks);
/* maybe we need to check and possibly fix the parity for this stripe
@@ -5122,6 +5135,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
* data on failed drives.
*/
if (rw == READ && mddev->degraded == 0 &&
+ !r5c_is_writeback(conf->log) &&
mddev->reshape_position == MaxSector) {
bi = chunk_aligned_read(mddev, bi);
if (!bi)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 46cfe93..8bae64b 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -345,7 +345,9 @@ enum {
STRIPE_BITMAP_PENDING, /* Being added to bitmap, don't add
* to batch yet.
*/
- STRIPE_LOG_TRAPPED, /* trapped into log */
+ STRIPE_LOG_TRAPPED, /* trapped into log */
+ STRIPE_R5C_FROZEN, /* r5c_cache frozen and being written out */
+ STRIPE_R5C_WRITTEN, /* ready for r5c_handle_stripe_written() */
};
#define STRIPE_EXPAND_SYNC_FLAGS \
@@ -712,4 +714,10 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
extern void r5l_quiesce(struct r5l_log *log, int state);
extern bool r5l_log_disk_error(struct r5conf *conf);
+extern bool r5c_is_writeback(struct r5l_log *log);
+extern int
+r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
+ struct stripe_head_state *s, int disks);
+extern void
+r5c_handle_stripe_written(struct r5conf *conf, struct stripe_head *sh);
#endif
--
2.9.3
^ permalink raw reply related
* [PATCH v5 4/8] md/r5cache: write part of r5cache
From: Song Liu @ 2016-10-13 5:49 UTC (permalink / raw)
To: linux-raid
Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
liuzhengyuan, Song Liu
In-Reply-To: <20161013054944.1038806-1-songliubraving@fb.com>
This is the write part of r5cache. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.
r5cache split current write path into 2 parts: the write path
and the reclaim path. The write path is as following:
1. write data to journal
(r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
(r5c_handle_data_cached, r5c_return_dev_pending_writes).
Then the reclaim path is as:
1. Freeze the stripe (r5c_freeze_stripe_for_reclaim)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks
Reclaim path of the cache is implemented in the next patch.
With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.
This patch adds 2 flags to stripe_head.state:
- STRIPE_R5C_PARTIAL_STRIPE,
- STRIPE_R5C_FULL_STRIPE,
Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".
For RMW, the code allocates an extra page for each data block
being updated. This is stored in r5dev->page and the old data
is read into it. Then the prexor calculation subtracts ->page
from the parity block, and the reconstruct calculation adds the
->orig_page data back into the parity block.
r5cache naturally excludes SkipCopy. With R5_Wantcache bit set,
async_copy_data will not skip copy.
There are some known limitations of the cache implementation:
1. Write cache only covers full page writes (R5_OVERWRITE). Writes
of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
writes for the same stripe have to wait. This can be improved by
moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
the log, so recovery code has to replay more than necessary data
(sometimes all the log from last_checkpoint). This reduces
availability of the array.
This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
drivers/md/raid5-cache.c | 204 +++++++++++++++++++++++++++++++++++++++++++++--
drivers/md/raid5.c | 144 ++++++++++++++++++++++++++++-----
drivers/md/raid5.h | 22 +++++
3 files changed, 344 insertions(+), 26 deletions(-)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9e05850..92d3d7b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -20,6 +20,7 @@
#include <linux/random.h>
#include "md.h"
#include "raid5.h"
+#include "bitmap.h"
/*
* metadata/data stored in disk with 4k size unit (a block) regardless
@@ -217,6 +218,44 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
io->state = state;
}
+static void
+r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
+ struct bio_list *return_bi)
+{
+ struct bio *wbi, *wbi2;
+
+ wbi = dev->written;
+ dev->written = NULL;
+ while (wbi && wbi->bi_iter.bi_sector <
+ dev->sector + STRIPE_SECTORS) {
+ wbi2 = r5_next_bio(wbi, dev->sector);
+ if (!raid5_dec_bi_active_stripes(wbi)) {
+ md_write_end(conf->mddev);
+ bio_list_add(return_bi, wbi);
+ }
+ wbi = wbi2;
+ }
+}
+
+void r5c_handle_cached_data_endio(struct r5conf *conf,
+ struct stripe_head *sh, int disks, struct bio_list *return_bi)
+{
+ int i;
+
+ for (i = sh->disks; i--; ) {
+ if (test_bit(R5_InCache, &sh->dev[i].flags) &&
+ sh->dev[i].written) {
+ set_bit(R5_UPTODATE, &sh->dev[i].flags);
+ r5c_return_dev_pending_writes(conf, &sh->dev[i],
+ return_bi);
+ bitmap_endwrite(conf->mddev->bitmap, sh->sector,
+ STRIPE_SECTORS,
+ !test_bit(STRIPE_DEGRADED, &sh->state),
+ 0);
+ }
+ }
+}
+
/*
* Freeze the stripe, thus send the stripe into reclaim path.
*
@@ -233,6 +272,48 @@ void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
return;
WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
set_bit(STRIPE_R5C_FROZEN, &sh->state);
+
+ if (log->r5c_state == R5C_STATE_WRITE_THROUGH)
+ return;
+
+ if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+ atomic_inc(&conf->preread_active_stripes);
+
+ if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
+ BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+ atomic_dec(&conf->r5c_cached_partial_stripes);
+ }
+
+ if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+ BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
+ atomic_dec(&conf->r5c_cached_full_stripes);
+ }
+}
+
+static void r5c_handle_data_cached(struct stripe_head *sh)
+{
+ int i;
+
+ for (i = sh->disks; i--; )
+ if (test_and_clear_bit(R5_Wantcache, &sh->dev[i].flags)) {
+ set_bit(R5_InCache, &sh->dev[i].flags);
+ clear_bit(R5_LOCKED, &sh->dev[i].flags);
+ atomic_inc(&sh->dev_in_cache);
+ }
+}
+
+/*
+ * this journal write must contain full parity,
+ * it may also contain some data pages
+ */
+static void r5c_handle_parity_cached(struct stripe_head *sh)
+{
+ int i;
+
+ for (i = sh->disks; i--; )
+ if (test_bit(R5_InCache, &sh->dev[i].flags))
+ set_bit(R5_Wantwrite, &sh->dev[i].flags);
+ set_bit(STRIPE_R5C_WRITTEN, &sh->state);
}
static void r5c_finish_cache_stripe(struct stripe_head *sh)
@@ -242,8 +323,10 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
set_bit(STRIPE_R5C_WRITTEN, &sh->state);
- } else
- BUG(); /* write back logic in next patch */
+ } else if (test_bit(STRIPE_R5C_FROZEN, &sh->state))
+ r5c_handle_parity_cached(sh);
+ else
+ r5c_handle_data_cached(sh);
}
static void r5l_io_run_stripes(struct r5l_io_unit *io)
@@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
io = log->current_io;
for (i = 0; i < sh->disks; i++) {
- if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
+ if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
+ !test_bit(R5_Wantcache, &sh->dev[i].flags))
continue;
if (i == sh->pd_idx || i == sh->qd_idx)
continue;
@@ -514,7 +598,6 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
return 0;
}
-static void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
/*
* running in raid5d, where reclaim could wait for raid5d too (when it flushes
* data from log to raid disks), so we shouldn't wait for reclaim here
@@ -544,6 +627,10 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
continue;
+
+ if (test_bit(R5_InCache, &sh->dev[i].flags))
+ continue;
+
write_disks++;
/* checksum is already calculated in last run */
if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
@@ -809,7 +896,6 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
}
}
-
static void r5l_do_reclaim(struct r5l_log *log)
{
sector_t reclaim_target = xchg(&log->reclaim_target, 0);
@@ -872,7 +958,7 @@ static void r5l_reclaim_thread(struct md_thread *thread)
r5l_do_reclaim(log);
}
-static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
+void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
{
unsigned long target;
unsigned long new = (unsigned long)space; /* overflow in theory */
@@ -1191,6 +1277,8 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
int disks)
{
struct r5l_log *log = conf->log;
+ int i;
+ struct r5dev *dev;
if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
return -EAGAIN;
@@ -1201,21 +1289,121 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
r5c_freeze_stripe_for_reclaim(sh);
return -EAGAIN;
}
- BUG(); /* write back logic in next commit */
+
+ s->to_cache = 0;
+
+ for (i = disks; i--; ) {
+ dev = &sh->dev[i];
+ /* if none-overwrite, use the reclaim path (write through) */
+ if (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
+ !test_bit(R5_InCache, &dev->flags)) {
+ r5c_freeze_stripe_for_reclaim(sh);
+ return -EAGAIN;
+ }
+ }
+
+ for (i = disks; i--; ) {
+ dev = &sh->dev[i];
+ if (dev->towrite) {
+ set_bit(R5_Wantcache, &dev->flags);
+ set_bit(R5_Wantdrain, &dev->flags);
+ set_bit(R5_LOCKED, &dev->flags);
+ s->to_cache++;
+ }
+ }
+
+ if (s->to_cache)
+ set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
+
return 0;
}
/*
* clean up the stripe (clear STRIPE_R5C_FROZEN etc.) after the stripe is
* committed to RAID disks
-*/
+ */
void r5c_handle_stripe_written(struct r5conf *conf,
struct stripe_head *sh)
{
+ int i;
+ int do_wakeup = 0;
+
if (!test_and_clear_bit(STRIPE_R5C_WRITTEN, &sh->state))
return;
WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
clear_bit(STRIPE_R5C_FROZEN, &sh->state);
+
+ if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH)
+ return;
+
+ for (i = sh->disks; i--; ) {
+ if (test_and_clear_bit(R5_InCache, &sh->dev[i].flags))
+ atomic_dec(&sh->dev_in_cache);
+ if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+ do_wakeup = 1;
+ }
+
+ if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
+ if (atomic_dec_and_test(&conf->pending_full_writes))
+ md_wakeup_thread(conf->mddev->thread);
+
+ if (do_wakeup)
+ wake_up(&conf->wait_for_overlap);
+}
+
+int
+r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
+ struct stripe_head_state *s)
+{
+ int pages;
+ int reserve;
+ int i;
+ int ret = 0;
+ int page_count = 0;
+
+ BUG_ON(!log);
+
+ for (i = 0; i < sh->disks; i++) {
+ void *addr;
+
+ if (!test_bit(R5_Wantcache, &sh->dev[i].flags))
+ continue;
+ addr = kmap_atomic(sh->dev[i].page);
+ sh->dev[i].log_checksum = crc32c_le(log->uuid_checksum,
+ addr, PAGE_SIZE);
+ kunmap_atomic(addr);
+ page_count++;
+ }
+ WARN_ON(page_count != s->to_cache);
+ pages = s->to_cache;
+
+ /*
+ * The stripe must enter state machine again to call endio, so
+ * don't delay.
+ */
+ clear_bit(STRIPE_DELAYED, &sh->state);
+ atomic_inc(&sh->count);
+
+ mutex_lock(&log->io_mutex);
+ /* meta + data */
+ reserve = (1 + pages) << (PAGE_SHIFT - 9);
+ if (!r5l_has_free_space(log, reserve)) {
+ spin_lock(&log->no_space_stripes_lock);
+ list_add_tail(&sh->log_list, &log->no_space_stripes);
+ spin_unlock(&log->no_space_stripes_lock);
+
+ r5l_wake_reclaim(log, reserve);
+ } else {
+ ret = r5l_log_stripe(log, sh, pages, 0);
+ if (ret) {
+ spin_lock_irq(&log->io_list_lock);
+ list_add_tail(&sh->log_list, &log->no_mem_stripes);
+ spin_unlock_irq(&log->io_list_lock);
+ }
+ }
+
+ mutex_unlock(&log->io_mutex);
+ return 0;
}
static int r5l_load_log(struct r5l_log *log)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2e3e61a..0539f34 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -245,8 +245,25 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
< IO_THRESHOLD)
md_wakeup_thread(conf->mddev->thread);
atomic_dec(&conf->active_stripes);
- if (!test_bit(STRIPE_EXPANDING, &sh->state))
- list_add_tail(&sh->lru, temp_inactive_list);
+ if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+ if (atomic_read(&sh->dev_in_cache) == 0) {
+ list_add_tail(&sh->lru, temp_inactive_list);
+ } else if (atomic_read(&sh->dev_in_cache) ==
+ conf->raid_disks - conf->max_degraded) {
+ /* full stripe */
+ if (!test_and_set_bit(STRIPE_R5C_FULL_STRIPE, &sh->state))
+ atomic_inc(&conf->r5c_cached_full_stripes);
+ if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
+ atomic_dec(&conf->r5c_cached_partial_stripes);
+ list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
+ } else {
+ /* partial stripe */
+ if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
+ &sh->state))
+ atomic_inc(&conf->r5c_cached_partial_stripes);
+ list_add_tail(&sh->lru, &conf->r5c_partial_stripe_list);
+ }
+ }
}
}
@@ -830,6 +847,11 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
might_sleep();
+ if (s->to_cache) {
+ r5c_cache_data(conf->log, sh, s);
+ return;
+ }
+
if (r5l_write_stripe(conf->log, sh) == 0)
return;
for (i = disks; i--; ) {
@@ -1044,7 +1066,7 @@ again:
static struct dma_async_tx_descriptor *
async_copy_data(int frombio, struct bio *bio, struct page **page,
sector_t sector, struct dma_async_tx_descriptor *tx,
- struct stripe_head *sh)
+ struct stripe_head *sh, int no_skipcopy)
{
struct bio_vec bvl;
struct bvec_iter iter;
@@ -1084,7 +1106,8 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
if (frombio) {
if (sh->raid_conf->skip_copy &&
b_offset == 0 && page_offset == 0 &&
- clen == STRIPE_SIZE)
+ clen == STRIPE_SIZE &&
+ !no_skipcopy)
*page = bio_page;
else
tx = async_memcpy(*page, bio_page, page_offset,
@@ -1166,7 +1189,7 @@ static void ops_run_biofill(struct stripe_head *sh)
while (rbi && rbi->bi_iter.bi_sector <
dev->sector + STRIPE_SECTORS) {
tx = async_copy_data(0, rbi, &dev->page,
- dev->sector, tx, sh);
+ dev->sector, tx, sh, 0);
rbi = r5_next_bio(rbi, dev->sector);
}
}
@@ -1293,7 +1316,8 @@ static int set_syndrome_sources(struct page **srcs,
if (i == sh->qd_idx || i == sh->pd_idx ||
(srctype == SYNDROME_SRC_ALL) ||
(srctype == SYNDROME_SRC_WANT_DRAIN &&
- test_bit(R5_Wantdrain, &dev->flags)) ||
+ (test_bit(R5_Wantdrain, &dev->flags) ||
+ test_bit(R5_InCache, &dev->flags))) ||
(srctype == SYNDROME_SRC_WRITTEN &&
dev->written))
srcs[slot] = sh->dev[i].page;
@@ -1472,9 +1496,25 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
static void ops_complete_prexor(void *stripe_head_ref)
{
struct stripe_head *sh = stripe_head_ref;
+ int i;
pr_debug("%s: stripe %llu\n", __func__,
(unsigned long long)sh->sector);
+
+ if (!r5c_is_writeback(sh->raid_conf->log))
+ return;
+
+ /*
+ * raid5-cache write back uses orig_page during prexor. after prexor,
+ * it is time to free orig_page
+ */
+ for (i = sh->disks; i--; )
+ if (sh->dev[i].page != sh->dev[i].orig_page) {
+ struct page *p = sh->dev[i].page;
+
+ sh->dev[i].page = sh->dev[i].orig_page;
+ put_page(p);
+ }
}
static struct dma_async_tx_descriptor *
@@ -1496,7 +1536,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
/* Only process blocks that are known to be uptodate */
- if (test_bit(R5_Wantdrain, &dev->flags))
+ if (test_bit(R5_Wantdrain, &dev->flags) ||
+ test_bit(R5_InCache, &dev->flags))
xor_srcs[count++] = dev->page;
}
@@ -1547,6 +1588,10 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
again:
dev = &sh->dev[i];
+ if (test_and_clear_bit(R5_InCache, &dev->flags)) {
+ BUG_ON(atomic_read(&sh->dev_in_cache) == 0);
+ atomic_dec(&sh->dev_in_cache);
+ }
spin_lock_irq(&sh->stripe_lock);
chosen = dev->towrite;
dev->towrite = NULL;
@@ -1566,8 +1611,10 @@ again:
set_bit(R5_Discard, &dev->flags);
else {
tx = async_copy_data(1, wbi, &dev->page,
- dev->sector, tx, sh);
- if (dev->page != dev->orig_page) {
+ dev->sector, tx, sh,
+ test_bit(R5_Wantcache, &dev->flags));
+ if (dev->page != dev->orig_page &&
+ !test_bit(R5_Wantcache, &dev->flags)) {
set_bit(R5_SkipCopy, &dev->flags);
clear_bit(R5_UPTODATE, &dev->flags);
clear_bit(R5_OVERWRITE, &dev->flags);
@@ -1675,7 +1722,8 @@ again:
xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
- if (head_sh->dev[i].written)
+ if (head_sh->dev[i].written ||
+ test_bit(R5_InCache, &head_sh->dev[i].flags))
xor_srcs[count++] = dev->page;
}
} else {
@@ -1930,6 +1978,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
INIT_LIST_HEAD(&sh->batch_list);
INIT_LIST_HEAD(&sh->lru);
atomic_set(&sh->count, 1);
+ atomic_set(&sh->dev_in_cache, 0);
for (i = 0; i < disks; i++) {
struct r5dev *dev = &sh->dev[i];
@@ -2810,12 +2859,30 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
+ /*
+ * Initially, handle_stripe_dirtying decided to run rmw
+ * and allocates extra page for prexor. However, rcw is
+ * cheaper later on. We need to free the extra page
+ * now, because we won't be able to do that in
+ * ops_complete_prexor().
+ */
+ if (sh->dev[i].page != sh->dev[i].orig_page) {
+ struct page *p = sh->dev[i].page;
+
+ p = sh->dev[i].page;
+ sh->dev[i].page = sh->dev[i].orig_page;
+ put_page(p);
+ }
+
if (dev->towrite) {
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantdrain, &dev->flags);
if (!expand)
clear_bit(R5_UPTODATE, &dev->flags);
s->locked++;
+ } else if (test_bit(R5_InCache, &dev->flags)) {
+ set_bit(R5_LOCKED, &dev->flags);
+ s->locked++;
}
}
/* if we are not expanding this is a proper write request, and
@@ -2855,6 +2922,9 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
set_bit(R5_LOCKED, &dev->flags);
clear_bit(R5_UPTODATE, &dev->flags);
s->locked++;
+ } else if (test_bit(R5_InCache, &dev->flags)) {
+ set_bit(R5_LOCKED, &dev->flags);
+ s->locked++;
}
}
if (!s->locked)
@@ -3529,9 +3599,12 @@ static void handle_stripe_dirtying(struct r5conf *conf,
} else for (i = disks; i--; ) {
/* would I have to read this buffer for read_modify_write */
struct r5dev *dev = &sh->dev[i];
- if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
+ if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx ||
+ test_bit(R5_InCache, &dev->flags)) &&
!test_bit(R5_LOCKED, &dev->flags) &&
- !(test_bit(R5_UPTODATE, &dev->flags) ||
+ !((test_bit(R5_UPTODATE, &dev->flags) &&
+ (!test_bit(R5_InCache, &dev->flags) ||
+ dev->page != dev->orig_page)) ||
test_bit(R5_Wantcompute, &dev->flags))) {
if (test_bit(R5_Insync, &dev->flags))
rmw++;
@@ -3543,13 +3616,15 @@ static void handle_stripe_dirtying(struct r5conf *conf,
i != sh->pd_idx && i != sh->qd_idx &&
!test_bit(R5_LOCKED, &dev->flags) &&
!(test_bit(R5_UPTODATE, &dev->flags) ||
- test_bit(R5_Wantcompute, &dev->flags))) {
+ test_bit(R5_InCache, &dev->flags) ||
+ test_bit(R5_Wantcompute, &dev->flags))) {
if (test_bit(R5_Insync, &dev->flags))
rcw++;
else
rcw += 2*disks;
}
}
+
pr_debug("for sector %llu, rmw=%d rcw=%d\n",
(unsigned long long)sh->sector, rmw, rcw);
set_bit(STRIPE_HANDLE, &sh->state);
@@ -3561,10 +3636,18 @@ static void handle_stripe_dirtying(struct r5conf *conf,
(unsigned long long)sh->sector, rmw);
for (i = disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
- if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
+ if (test_bit(R5_InCache, &dev->flags) &&
+ dev->page == dev->orig_page)
+ dev->page = alloc_page(GFP_NOIO); /* prexor */
+
+ if ((dev->towrite ||
+ i == sh->pd_idx || i == sh->qd_idx ||
+ test_bit(R5_InCache, &dev->flags)) &&
!test_bit(R5_LOCKED, &dev->flags) &&
- !(test_bit(R5_UPTODATE, &dev->flags) ||
- test_bit(R5_Wantcompute, &dev->flags)) &&
+ !((test_bit(R5_UPTODATE, &dev->flags) &&
+ (!test_bit(R5_InCache, &dev->flags) ||
+ dev->page != dev->orig_page)) ||
+ test_bit(R5_Wantcompute, &dev->flags)) &&
test_bit(R5_Insync, &dev->flags)) {
if (test_bit(STRIPE_PREREAD_ACTIVE,
&sh->state)) {
@@ -3590,6 +3673,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
i != sh->pd_idx && i != sh->qd_idx &&
!test_bit(R5_LOCKED, &dev->flags) &&
!(test_bit(R5_UPTODATE, &dev->flags) ||
+ test_bit(R5_InCache, &dev->flags) ||
test_bit(R5_Wantcompute, &dev->flags))) {
rcw++;
if (test_bit(R5_Insync, &dev->flags) &&
@@ -3629,7 +3713,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
*/
if ((s->req_compute || !test_bit(STRIPE_COMPUTE_RUN, &sh->state)) &&
(s->locked == 0 && (rcw == 0 || rmw == 0) &&
- !test_bit(STRIPE_BIT_DELAY, &sh->state)))
+ !test_bit(STRIPE_BIT_DELAY, &sh->state)))
schedule_reconstruction(sh, s, rcw == 0, 0);
}
@@ -4120,6 +4204,10 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
if (rdev && !test_bit(Faulty, &rdev->flags))
do_recovery = 1;
}
+ if (test_bit(R5_InCache, &dev->flags) && dev->written)
+ s->just_cached++;
+ if (test_bit(R5_Wantcache, &dev->flags) && dev->written)
+ s->want_cache++;
}
if (test_bit(STRIPE_SYNCING, &sh->state)) {
/* If there is a failed device being replaced,
@@ -4285,6 +4373,17 @@ static void handle_stripe(struct stripe_head *sh)
analyse_stripe(sh, &s);
+ if (s.want_cache) {
+ /* In last run of handle_stripe, we have finished
+ * r5c_handle_stripe_dirtying and ops_run_biodrain, but
+ * r5c_cache_data didn't finish because the journal device
+ * didn't have enough space. This time we should continue
+ * r5c_cache_data
+ */
+ s.to_cache = s.want_cache;
+ goto finish;
+ }
+
if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
goto finish;
@@ -4348,7 +4447,7 @@ static void handle_stripe(struct stripe_head *sh)
struct r5dev *dev = &sh->dev[i];
if (test_bit(R5_LOCKED, &dev->flags) &&
(i == sh->pd_idx || i == sh->qd_idx ||
- dev->written)) {
+ dev->written || test_bit(R5_InCache, &dev->flags))) {
pr_debug("Writing block %d\n", i);
set_bit(R5_Wantwrite, &dev->flags);
if (prexor)
@@ -4388,6 +4487,10 @@ static void handle_stripe(struct stripe_head *sh)
test_bit(R5_Discard, &qdev->flags))))))
handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+ if (s.just_cached)
+ r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
+ r5l_stripe_write_finished(sh);
+
/* Now we might consider reading some blocks, either to check/generate
* parity, or to satisfy requests
* or to load a block that is being partially written.
@@ -6526,6 +6629,11 @@ static struct r5conf *setup_conf(struct mddev *mddev)
for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
INIT_LIST_HEAD(conf->temp_inactive_list + i);
+ atomic_set(&conf->r5c_cached_full_stripes, 0);
+ INIT_LIST_HEAD(&conf->r5c_full_stripe_list);
+ atomic_set(&conf->r5c_cached_partial_stripes, 0);
+ INIT_LIST_HEAD(&conf->r5c_partial_stripe_list);
+
conf->level = mddev->new_level;
conf->chunk_sectors = mddev->new_chunk_sectors;
if (raid5_alloc_percpu(conf) != 0)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 8bae64b..ac6d7c7 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -226,6 +226,7 @@ struct stripe_head {
struct r5l_io_unit *log_io;
struct list_head log_list;
+ atomic_t dev_in_cache;
/**
* struct stripe_operations
* @target - STRIPE_OP_COMPUTE_BLK target
@@ -263,6 +264,7 @@ struct stripe_head_state {
*/
int syncing, expanding, expanded, replacing;
int locked, uptodate, to_read, to_write, failed, written;
+ int to_cache, want_cache, just_cached;
int to_fill, compute, req_compute, non_overwrite;
int failed_num[2];
int p_failed, q_failed;
@@ -313,6 +315,8 @@ enum r5dev_flags {
*/
R5_Discard, /* Discard the stripe */
R5_SkipCopy, /* Don't copy data from bio to stripe cache */
+ R5_Wantcache, /* Want write data to write cache */
+ R5_InCache, /* Data in cache */
};
/*
@@ -348,6 +352,12 @@ enum {
STRIPE_LOG_TRAPPED, /* trapped into log */
STRIPE_R5C_FROZEN, /* r5c_cache frozen and being written out */
STRIPE_R5C_WRITTEN, /* ready for r5c_handle_stripe_written() */
+ STRIPE_R5C_PARTIAL_STRIPE, /* in r5c cache (to-be/being handled or
+ * in conf->r5c_partial_stripe_list)
+ */
+ STRIPE_R5C_FULL_STRIPE, /* in r5c cache (to-be/being handled or
+ * in conf->r5c_full_stripe_list)
+ */
};
#define STRIPE_EXPAND_SYNC_FLAGS \
@@ -600,6 +610,12 @@ struct r5conf {
*/
atomic_t active_stripes;
struct list_head inactive_list[NR_STRIPE_HASH_LOCKS];
+
+ atomic_t r5c_cached_full_stripes;
+ struct list_head r5c_full_stripe_list;
+ atomic_t r5c_cached_partial_stripes;
+ struct list_head r5c_partial_stripe_list;
+
atomic_t empty_inactive_list_nr;
struct llist_head released_stripes;
wait_queue_head_t wait_for_quiescent;
@@ -720,4 +736,10 @@ r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
struct stripe_head_state *s, int disks);
extern void
r5c_handle_stripe_written(struct r5conf *conf, struct stripe_head *sh);
+extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
+extern void r5c_handle_cached_data_endio(struct r5conf *conf,
+ struct stripe_head *sh, int disks, struct bio_list *return_bi);
+extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
+ struct stripe_head_state *s);
+extern void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh);
#endif
--
2.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox