* Re: Level change from 4 disk RAID5 to 4 disk RAID6
From: NeilBrown @ 2017-04-10 1:04 UTC (permalink / raw)
To: LM, linux-raid
In-Reply-To: <20170408214239.GF10267@lars-laptop>
[-- Attachment #1: Type: text/plain, Size: 1632 bytes --]
On Sat, Apr 08 2017, LM wrote:
> Hi,
>
> I have a 4 disk RAID5, the used dev size is 640.05 GB. Now I want to
> replace the 4 disks by 4 disks with a size of 2TB each.
>
> As far as I understand the man page, this can be achieved by replacing
> the devices one after another and for each device rebuild the degraded
> array with:
>
> mdadm /dev/md0 --add /dev/sdX1
>
> Then the level change can be done together with growing the array:
>
> mdadm --grow /dev/md0 --level=raid6 --backup-file=/root/backup-md0
>
> Does this work?
>
> I am asking if it works, because the man page also says:
>
>> mdadm --grow /dev/md4 --level=6 --backup-file=/root/backup-md4
>> The array /dev/md4 which is currently a RAID5 array will
>> be converted to RAID6. There should normally already be
>> a spare drive attached to the array as a RAID6 needs one
>> more drive than a matching RAID5.
>
> And in my case only the size of disks is increased, not their number.
>
Yes, it probably works, and you probably don't need a backup file.
Though you might need to explicitly tell mdadm to keep the number of
devices unchanged by specifying "--raid-disk=4".
You probably aren't very encouraged that I say "probably" and "might",
and this is deliberate.
I recommend that you crate 4 10Meg files, use losetup to create 10M
devices, and build a RAID5 over them with --size=5M.
Then try the --grow --level=6 command, and see what happens.
If you mess up, you can easily start from scratch and try again.
If it works, you can have some confidence that the same process will
have the same result on real devices.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* (unknown),
From: hp @ 2017-04-10 3:30 UTC (permalink / raw)
To: linux-raid
[-- Attachment #1: 7718637436266_linux-raid.zip --]
[-- Type: application/zip, Size: 3603 bytes --]
^ permalink raw reply
* [PATCH] mdadm.c:fix compile warning "mdfd is uninitialized"
From: Zhilong Liu @ 2017-04-10 4:49 UTC (permalink / raw)
To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <016b495a-1361-0f29-0fcb-6af008625565@gmail.com>
Initialized the mdfd as -1 to prevent compile error
of some compilers.
For example, gcc version 4.8.5(SUSE Linux).
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
mdadm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mdadm.c b/mdadm.c
index 001ff68..41dae1d 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1916,7 +1916,7 @@ static int misc_list(struct mddev_dev *devlist,
int rv = 0;
for (dv = devlist; dv; dv = (rv & 16) ? NULL : dv->next) {
- int mdfd;
+ int mdfd = -1;
switch(dv->disposition) {
case 'D':
--
2.6.6
^ permalink raw reply related
* Re: Level change from 4 disk RAID5 to 4 disk RAID6
From: Wols Lists @ 2017-04-10 5:41 UTC (permalink / raw)
To: LM, linux-raid
In-Reply-To: <20170408214239.GF10267@lars-laptop>
On 08/04/17 22:42, LM wrote:
> Hi,
>
> I have a 4 disk RAID5, the used dev size is 640.05 GB. Now I want to
> replace the 4 disks by 4 disks with a size of 2TB each.
>
> As far as I understand the man page, this can be achieved by replacing
> the devices one after another and for each device rebuild the degraded
> array with:
>
> mdadm /dev/md0 --add /dev/sdX1
Do you have a spare SATA port or whatever your drives are. If so, then
use the --replace option to mdadm, don't fail then add. You're risking a
drive failure taking out your array - not a good move.
And if you don't have a spare port, $20 for a PCI card or whatever is a
good investment to keep your data safe.
Have a look at the raid wiki - it tries to be a bit more verbose and
easily comprehensible than the man page.
Cheers,
Wol
^ permalink raw reply
* [PATCH] md.c:didn't unlock the mddev before return EINVAL in array_size_store
From: Zhilong Liu @ 2017-04-10 6:15 UTC (permalink / raw)
To: shli; +Cc: linux-raid, Zhilong Liu
md.c: it needs to release the mddev lock before
the array_size_store() returns.
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
drivers/md/md.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index f6ae1d6..5327236 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4843,8 +4843,10 @@ array_size_store(struct mddev *mddev, const char *buf, size_t len)
return err;
/* cluster raid doesn't support change array_sectors */
- if (mddev_is_clustered(mddev))
+ if (mddev_is_clustered(mddev)) {
+ mddev_unlock(mddev);
return -EINVAL;
+ }
if (strncmp(buf, "default", 7) == 0) {
if (mddev->pers)
--
2.6.6
^ permalink raw reply related
* Re: [v2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome
From: Michael Ellerman @ 2017-04-10 6:54 UTC (permalink / raw)
To: Daniel Axtens, Matt Brown, linuxppc-dev; +Cc: linux-raid
In-Reply-To: <87r316hxmi.fsf@possimpible.ozlabs.ibm.com>
Daniel Axtens <dja@axtens.net> writes:
> Hi Matt,
>
> Thanks for answering my questions and doing those fixes.
>
>
>> Bugs fixed:
>> - A small bug in pq.h regarding a missing and mismatched
>> ifdef statement
>> - Fixed test/Makefile to correctly build test on ppc
>>
>
> I think this commit should be labelled:
> Fixes: 4f8c55c5ad49 ("lib/raid6: build proper files on corresponding arch")
>
> mpe can probably add that when he merges - no need to do a new version :)
Please send a separate patch which does that fix.
>> else
>> - HAS_ALTIVEC := $(shell printf '\#include <altivec.h>\nvector int a;\n' |\
>> - gcc -c -x c - >&/dev/null && \
>> - rm ./-.o && echo yes)
>> - ifeq ($(HAS_ALTIVEC),yes)
>> - OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
>> + HAS_ALTIVEC := $(shell printf '\#include <altivec.h>\nvector int a;\n' |\
>> + gcc -c -x c - >/dev/null && rm ./-.o && echo yes)
>> + ifeq ($(HAS_ALTIVEC),yes)
>> + CFLAGS += -I../../../arch/powerpc/include
>> + CFLAGS += -DCONFIG_ALTIVEC
>> + OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
>> + vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
>> endif
>> endif
> Looks like vim has replaced spaces with tabs here. Not sure how much we
> care...
We care at least because it makes the diff look bigger than it really
is, if I'm reading it right the first three lines haven't actually
changed.
>> @@ -97,6 +99,18 @@ altivec4.c: altivec.uc ../unroll.awk
>> altivec8.c: altivec.uc ../unroll.awk
>> $(AWK) ../unroll.awk -vN=8 < altivec.uc > $@
>>
> ... especially seeing as tabs are already used in the file here!
It's a Makefile! Tabs have meaning :)
>> +# ifdef __KERNEL__
>> + return (cpu_has_feature(CONFIG_ALTIVEC) &&
>> + cpu_has_feature(CPU_FTR_ARCH_207S));
> I think CPU_FTR_ARCH_207S implies Altivec? Again, not a real problem,
It doesn't.
And also CONFIG_ALTIVEC is not a cpu feature!
You should be using CPU_FTR_ALTIVEC_COMP. That copes with the case where
the kernel is compiled without ALTIVEC support.
cheers
^ permalink raw reply
* Re: [PATCH] md.c:didn't unlock the mddev before return EINVAL in array_size_store
From: Guoqing Jiang @ 2017-04-10 7:23 UTC (permalink / raw)
To: Zhilong Liu, shli; +Cc: linux-raid
In-Reply-To: <1491804955-7548-1-git-send-email-zlliu@suse.com>
On 04/10/2017 02:15 PM, Zhilong Liu wrote:
> md.c: it needs to release the mddev lock before
> the array_size_store() returns.
Fixes: ab5a98b132fd ("md-cluster: change array_sectors and update size
are not supported")
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
> drivers/md/md.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d6..5327236 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4843,8 +4843,10 @@ array_size_store(struct mddev *mddev, const char *buf, size_t len)
> return err;
>
> /* cluster raid doesn't support change array_sectors */
> - if (mddev_is_clustered(mddev))
> + if (mddev_is_clustered(mddev)) {
> + mddev_unlock(mddev);
> return -EINVAL;
> + }
Reviewed-by: Guoqing Jiang <gqjiang@suse.com>
Thanks,
Guoqing
^ permalink raw reply
* [PATCH 0/2] mdadm/manpage: update manpage for readonly and array-size
From: Zhilong Liu @ 2017-04-10 8:01 UTC (permalink / raw)
To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
Hi, Jes;
These two patches is to update the description for readonly and array-size
sectors.
Thanks,
Zhilong
---
Zhilong Liu (2):
mdadm/manpage:update description for readonly in manpage
mdadm/manpage:clustered array doesn't support --array-size yet
mdadm.8.in | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
--
2.6.6
^ permalink raw reply
* [PATCH 1/2] mdadm/manpage:update description for readonly in manpage
From: Zhilong Liu @ 2017-04-10 8:04 UTC (permalink / raw)
To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <1491811296-9118-1-git-send-email-zlliu@suse.com>
update readonly/readwrite in man page:
Currently both the readwrite and readonly are worked well,
thus updates description for them.
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
mdadm.8.in | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mdadm.8.in b/mdadm.8.in
index 744c12b..f006cf5 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -925,7 +925,8 @@ will not try to be so clever.
Start the array
.B read only
rather than read-write as normal. No writes will be allowed to the
-array, and no resync, recovery, or reshape will be started.
+array, and no resync, recovery, or reshape will be started. It works with
+Create, Assemble, Manage and Misc mode.
.TP
.BR \-a ", " "\-\-auto{=yes,md,mdp,part,p}{NN}"
@@ -2232,7 +2233,7 @@ be in use.
.TP
.B \-\-readonly
-start the array readonly \(em not supported yet.
+start the array with readonly status.
.SH MANAGE MODE
.HP 12
@@ -2438,12 +2439,11 @@ This will fully activate a partially assembled md array.
.TP
.B \-\-readonly
-This will mark an active array as read-only, providing that it is
-not currently being used.
+This will set an active array as read-only status.
.TP
.B \-\-readwrite
-This will change a
+This will change an
.B readonly
array back to being read/write.
--
2.6.6
^ permalink raw reply related
* [PATCH 2/2] mdadm/manpage:clustered array doesn't support --array-size yet
From: Zhilong Liu @ 2017-04-10 8:04 UTC (permalink / raw)
To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <1491811478-9234-1-git-send-email-zlliu@suse.com>
update man page for --array-size:
clustered array isn't allowed to change array_sector by now.
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
mdadm.8.in | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mdadm.8.in b/mdadm.8.in
index f006cf5..3555516 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -541,6 +541,10 @@ A value of
restores the apparent size of the array to be whatever the real
amount of available space is.
+The
+.B clustered
+array isn't supported this parameter yet.
+
.TP
.BR \-c ", " \-\-chunk=
Specify chunk size of kilobytes. The default when creating an
--
2.6.6
^ permalink raw reply related
* Re: Can we deprecate ioctl(RAID_VERSION)?
From: Nix @ 2017-04-10 9:26 UTC (permalink / raw)
To: NeilBrown; +Cc: jes.sorensen, linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <878tn9yymi.fsf@notabene.neil.brown.name>
On 10 Apr 2017, NeilBrown verbalised:
> On Fri, Apr 07 2017, jes.sorensen@gmail.com wrote:
>
>> Next question since I am wearing my 'what is this old stuff doing' hat.
>> mdassemble? Does anything still use this? The reason is a lot of the
>> newer features are explicitly included, and switching to sysfs is
>> effectively going to kill it, unless it gets a major upgrade.
>>
>
> I was never a big fan, of mdassemble, but it is smaller than mdadm and
> some people apparently have (or had) space-constrained boot
> environments.
It also has fewer build-time requirements and can build on systems with
things like old buggy versions of uclibc that can't build a working
mdadm. (Of course, now musl exists, I'm not sure anyone should care
about that. I stopped caring many years ago.)
--
NULL && (void)
^ permalink raw reply
* Re: md-cluster Oops 4.9.13
From: Marc Smith @ 2017-04-10 13:25 UTC (permalink / raw)
To: Guoqing Jiang; +Cc: linux-raid
In-Reply-To: <58E45E0C.6030705@gmail.com>
Hi,
Sorry for the delay... I was hoping to cherry-pick this and test
against 4.9.x, but it didn't apply cleanly, although it looks trivial
to do it by hand. Is it recommended/okay to test this patch against
4.9.x? Will the fix eventually be merged into 4.9.x?
--Marc
On Tue, Apr 4, 2017 at 11:01 PM, Guoqing Jiang <jgq516@gmail.com> wrote:
>
>
> On 04/04/2017 10:06 PM, Marc Smith wrote:
>>
>> Hi,
>>
>> I encountered an oops this morning when stopping a MD array
>> (md-cluster)... there were 4 md-cluster array started, and they were
>> in the middle of a rebuild. I stopped the first one and then stopped
>> the second one immediately after and got the oops, here is a
>> transcript of what was on my terminal session:
>>
>> [root@brimstone-1b ~]# mdadm --stop /dev/md/array1
>> mdadm: stopped /dev/md/array1
>> [root@brimstone-1b ~]# mdadm --stop /dev/md/array2
>>
>> Message from syslogd@brimstone-1b at Tue Apr 4 09:54:40 2017 ...
>> brimstone-1b kernel: [649162.174685] BUG: unable to handle kernel NULL
>> pointer dereference at 0000000000000098
>>
>> Using Linux 4.9.13 and here is the output from the kernel messages:
>>
>> --snip--
>> [649158.014731] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: leaving the
>> lockspace group...
>> [649158.015233] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: group event
>> done 0 0
>> [649158.015303] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8:
>> release_lockspace final free
>> [649158.015331] md: unbind<nvme0n1p1>
>> [649158.042540] md: export_rdev(nvme0n1p1)
>> [649158.042546] md: unbind<nvme1n1p1>
>> [649158.048501] md: export_rdev(nvme1n1p1)
>> [649161.759022] md127: detected capacity change from 1000068874240 to 0
>> [649161.759025] md: md127 stopped.
>> [649162.174685] BUG: unable to handle kernel NULL pointer dereference
>> at 0000000000000098
>> [649162.174727] IP: [<ffffffff81868b40>] recv_daemon+0x1e9/0x373
>
>
> Looks like the recv_daemon is still running after stop array, commit
> 48df498 "md: move bitmap_destroy to the beginning of __md_stop"
> ensure it won't happen.
>
>
> [snip]
>
>> Perhaps this is already fixed in later versions? Let me know if you
>> need any additional information.
>
>
> Could you pls try with the latest version? Please let me know if you
> still see it, thanks.
>
> Regards,
> Guoqing
>
^ permalink raw reply
* Re: [RFC PATCH v5] IV Generation algorithms for dm-crypt
From: Milan Broz @ 2017-04-10 14:00 UTC (permalink / raw)
To: Binoy Jayan, Oded, Ofir
Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz, Gilad
In-Reply-To: <1491562064-23591-1-git-send-email-binoy.jayan@linaro.org>
On 04/07/2017 12:47 PM, Binoy Jayan wrote:
> ===============================================================================
> dm-crypt optimization for larger block sizes
> ===============================================================================
...
> Tests with dd [direct i/o]
>
> Sequential read -0.134 %
> Sequential Write +0.091 %
>
> Tests with fio [Aggregate bandwidth - aggrb]
>
> Random Read +0.358 %
> Random Write +0.010 %
>
> Tests with bonnie++ [768 MiB File, 384 MiB Ram]
> after mounting dm-crypt target as ext4
>
> Sequential o/p [per-char] -2.876 %
> Sequential o/p [per-blk] +0.992 %
> Sequential o/p [re-write] +4.465 %
>
> Sequential i/p [per-char] -0.453 %
> Sequential i/p [per-blk] -0.740 %
>
> Sequential create -0.255 %
> Sequential delete +0.042 %
> Random create -0.007 %
> Random delete +0.454 %
>
> NB: The '+' sign shows improvement and '-' shows degradation.
> The tests were performed with minimal cpu load.
> Tests with higher cpu load to be done
Well, it is good that there is no performance degradation but it
would be nice to have some user of it that proves it is really
working for your hw.
FYI - with patch that increases dmcrypt sector size to 4k
I can see improvement in speed usually in 5-15% with sync AES-NI
(depends on access pattern), with dmcrypt mapped to memory
it is even close to 20% speed up (but such a configuration is
completely artificial).
I wonder why increased dmcrypt sector size does not work for your hw,
it should help as well (and can be combiuned later with this IV approach).
(For native 4k drives this should be used in future anyway...)
Milan
^ permalink raw reply
* Re: [PATCH 2/7] Makefile, x86, LLVM: disable unsupported optimization flags
From: Masahiro Yamada @ 2017-04-10 14:54 UTC (permalink / raw)
To: Michael Davidson
Cc: Matthias Kaehlcke, Michal Marek, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Herbert Xu, David S. Miller, Shaohua Li,
Alexander Potapenko, Dmitry Vyukov, X86 ML,
Linux Kbuild mailing list, Linux Kernel Mailing List,
linux-crypto, linux-raid, Arnd Bergmann
In-Reply-To: <CA+=D-XVbZfgaQ+ZrjctAF5Ek3kaF3XqTLsSu==hK9savDztKsw@mail.gmail.com>
Hi.
2017-04-06 4:11 GMT+09:00 Michael Davidson <md@google.com>:
> It "works" for the cases that I currently care about but I have to say
> that I am uneasy about adding -Werror to the cc-option test in this
> way.
>
> Suppose that one of the *other* flags that is implicitly passed to the
> compiler by cc-option - eg something that was explicitly specified in
> $(KBUILD_CFLAGS) - triggers a warning. In that case all calls to
> cc-option will silently fail because of the -Werror and valid options
> will not be detected correctly.
Theoretically, options explicitly specified in KBUILD_CFLAGS
should be always valid.
Options that may not be supported in some cases
should be wrapped with $(call cc-option ).
> If everyone is OK with that because "it shouldn't normally ever
> happen" then that is fine, but if does result in a subtle change from
> existing behavior (and a trap that I almost immediately fell into
> after applying a similar patch).
There is a rare case where a particular combination fails
(such as the conflict between -pg and -ffunction-sections
as reported in https://patchwork.kernel.org/patch/9624573/).
In a such case, we may end up with swapping the order,
but this should not happen quite often.
> On Wed, Apr 5, 2017 at 12:01 PM, Matthias Kaehlcke <mka@chromium.org> wrote:
>> Hi Masahiro,
>>
>> El Thu, Apr 06, 2017 at 03:08:26AM +0900 Masahiro Yamada ha dit:
>>
>>> 2017-03-17 9:15 GMT+09:00 Michael Davidson <md@google.com>:
>>> > Unfortunately, while clang generates a warning about these flags
>>> > being unsupported it still exits with a status of 0 so we have
>>> > to explicitly disable them instead of just using a cc-option check.
>>> >
>>> > Signed-off-by: Michael Davidson <md@google.com>
>>>
>>>
>>> Instead, does the following work for you?
>>> https://patchwork.kernel.org/patch/9657285/
>>
>> Thanks for the pointer, I was about to give this change (or rather its
>> ancestor) a rework myself :)
>>
>>> You need to use
>>> $(call cc-option, ...)
>>> for -falign-jumps=1 and -falign-loops=1
>>
>> I can confirm that this works.
>>
>> Thanks
>>
>> Matthias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards
Masahiro Yamada
^ permalink raw reply
* Re: [PATCH v2] md/r5cache: gracefully handle journal device errors for writeback mode
From: Shaohua Li @ 2017-04-10 16:21 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch,
jes.sorensen
In-Reply-To: <20170329080013.1445439-1-songliubraving@fb.com>
On Wed, Mar 29, 2017 at 01:00:13AM -0700, Song Liu wrote:
> For the raid456 with writeback cache, when journal device failed during
> normal operation, it is still possible to persist all data, as all
> pending data is still in stripe cache. However, it is necessary to handle
> journal failure gracefully.
>
> During journal failures, this patch makes the follow changes to land data
> in cache to raid disks gracefully:
>
> 1. In raid5_remove_disk(), flush all cached stripes;
> 2. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
> to make progress;
> 3. In delay_towrite(), only process data in the cache (skip dev->towrite);
> 4. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
> in loprio_list
> 5. In r5l_do_submit_io(), submit io->split_bio first (see inline comments
> for details).
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/raid5-cache.c | 27 ++++++++++++++++++---------
> drivers/md/raid5.c | 28 ++++++++++++++++++++++++----
> 2 files changed, 42 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index b6194e0..0838617 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -632,20 +632,29 @@ static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
> __r5l_set_io_unit_state(io, IO_UNIT_IO_START);
> spin_unlock_irqrestore(&log->io_list_lock, flags);
>
> + /*
> + * In case of journal device failures, submit_bio will get error
> + * and calls endio, then active stripes will continue write
> + * process. Therefore, it is not necessary to check Faulty bit
> + * of journal device here.
> + *
> + * However, calling r5l_log_endio(current_bio) may change
> + * split_bio. Therefore, it is necessary to check split_bio before
> + * submit current_bio.
> + */
sorry, for the delay. what did you mean 'calling r5l_log_endio may change
split_bio'? The split_bio is chained into current_bio. The endio of
current_bio(r5l_log_endio) is only called after all chained bio completion. I
didn't get the point why this change.
> + if (io->split_bio) {
> + if (io->has_flush)
> + io->split_bio->bi_opf |= REQ_PREFLUSH;
> + if (io->has_fua)
> + io->split_bio->bi_opf |= REQ_FUA;
> + submit_bio(io->split_bio);
> + }
> +
> if (io->has_flush)
> io->current_bio->bi_opf |= REQ_PREFLUSH;
> if (io->has_fua)
> io->current_bio->bi_opf |= REQ_FUA;
> submit_bio(io->current_bio);
> -
> - if (!io->split_bio)
> - return;
> -
> - if (io->has_flush)
> - io->split_bio->bi_opf |= REQ_PREFLUSH;
> - if (io->has_fua)
> - io->split_bio->bi_opf |= REQ_FUA;
> - submit_bio(io->split_bio);
> }
>
> /* deferred io_unit will be dispatched here */
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 6036d5e..4d3d1ab 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3054,6 +3054,11 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
> * When LOG_CRITICAL, stripes with injournal == 0 will be sent to
> * no_space_stripes list.
> *
> + * 3. during journal failure
> + * In journal failure, we try to flush all cached data to raid disks
> + * based on data in stripe cache. The array is read-only to upper
> + * layers, so we would skip all pending writes.
> + *
> */
> static inline bool delay_towrite(struct r5conf *conf,
> struct r5dev *dev,
> @@ -3067,6 +3072,9 @@ static inline bool delay_towrite(struct r5conf *conf,
> if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
> s->injournal > 0)
> return true;
> + /* case 3 above */
> + if (s->log_failed && s->injournal)
> + return true;
> return false;
> }
>
> @@ -4689,10 +4697,15 @@ static void handle_stripe(struct stripe_head *sh)
> " to_write=%d failed=%d failed_num=%d,%d\n",
> s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
> s.failed_num[0], s.failed_num[1]);
> - /* check if the array has lost more than max_degraded devices and,
> + /*
> + * check if the array has lost more than max_degraded devices and,
> * if so, some requests might need to be failed.
> + *
> + * When journal device failed (log_failed), we will only process
> + * the stripe if there is data need write to raid disks
> */
> - if (s.failed > conf->max_degraded || s.log_failed) {
> + if (s.failed > conf->max_degraded ||
> + (s.log_failed && s.injournal == 0)) {
> sh->check_state = 0;
> sh->reconstruct_state = 0;
> break_stripe_batch_list(sh, 0);
> @@ -5272,7 +5285,8 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
> struct list_head *handle_list = NULL;
> struct r5worker_group *wg;
> bool second_try = !r5c_is_writeback(conf->log);
> - bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state);
> + bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state) ||
> + r5l_log_disk_error(conf);
>
> again:
> wg = NULL;
> @@ -7526,6 +7540,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
> int number = rdev->raid_disk;
> struct md_rdev **rdevp;
> struct disk_info *p = conf->disks + number;
> + unsigned long flags;
>
> print_raid5_conf(conf);
> if (test_bit(Journal, &rdev->flags) && conf->log) {
> @@ -7535,7 +7550,12 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
> * neilb: there is no locking about new writes here,
> * so this cannot be safe.
> */
> - if (atomic_read(&conf->active_stripes)) {
> + if (atomic_read(&conf->active_stripes) ||
> + atomic_read(&conf->r5c_cached_full_stripes) ||
> + atomic_read(&conf->r5c_cached_partial_stripes)) {
> + spin_lock_irqsave(&conf->device_lock, flags);
> + r5c_flush_cache(conf, INT_MAX);
> + spin_unlock_irqrestore(&conf->device_lock, flags);
> return -EBUSY;
It's weird this is called in raid5_remove_disk, shouldn't this be called in log
disk error in case user doesn't remove the log disk? And this is a policy
change. User might not want to do the flush, as this exposes write hole. I
think at least we should print info out here to warn user the flush.
Thanks,
Shaohua
^ permalink raw reply
* Re: [PATCH] md: update slab_cache before releasing new stripes when stripes resizing
From: Shaohua Li @ 2017-04-10 16:25 UTC (permalink / raw)
To: Dennis Yang; +Cc: linux-raid
In-Reply-To: <1490773573-32692-1-git-send-email-dennisyang@qnap.com>
On Wed, Mar 29, 2017 at 03:46:13PM +0800, Dennis Yang wrote:
> When growing raid5 device on machine with small memory, there is chance that
> mdadm will be killed and the following bug report can be observed. The same
> bug could also be reproduced in linux-4.10.6.
>
> [57600.075774] BUG: unable to handle kernel NULL pointer dereference at (null)
> [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
> [57600.114678] Oops: 0002 [#1] SMP
> [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P O 4.2.8 #1
> [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
> [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
> [57600.204963] RIP: 0010:[<ffffffff81a6aa87>] [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.213057] RSP: 0018:ffff880043073810 EFLAGS: 00010046
> [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
> [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
> [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
> [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
> [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
> [57600.253999] FS: 00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
> [57600.262078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
> [57600.274942] Stack:
> [57600.276949] ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
> [57600.284383] ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
> [57600.291820] ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
> [57600.299254] Call Trace:
> [57600.301698] [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
> [57600.307523] [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
> [57600.313779] [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
> [57600.319344] [<ffffffff81579942>] drop_one_stripe+0x62/0x90
> [57600.324915] [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
> [57600.330563] [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
> [57600.336650] [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
> [57600.342039] [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
> [57600.348210] [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
> [57600.353959] [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
> [57600.360303] [<ffffffff8157a30b>] check_reshape+0x62b/0x770
> [57600.365866] [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
> [57600.371778] [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
> [57600.377604] [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
> [57600.382827] [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
> [57600.388307] [<ffffffff81195238>] block_ioctl+0x38/0x40
> [57600.393525] [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
> [57600.399010] [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
> [57600.404400] [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
> [57600.409447] [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
> [57600.435460] RIP [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.441208] RSP <ffff880043073810>
> [57600.444690] CR2: 0000000000000000
> [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---
>
> The problem is that resize_stripes() releases new stripe_heads before assigning new
> slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
> after resize_stripes() starting releasing new stripes but right before new slab cache
> being assigned, it is possible that these new stripe_heads will be freed with the old
> slab_cache which was already been destoryed and that triggers this bug.
applied, thanks!
> Signed-off-by: Dennis Yang <dennisyang@qnap.com>
> ---
> drivers/md/raid5.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 6661db2c..172edc1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2286,6 +2286,10 @@ static int resize_stripes(struct r5conf *conf, int newsize)
> err = -ENOMEM;
>
> mutex_unlock(&conf->cache_size_mutex);
> +
> + conf->slab_cache = sc;
> + conf->active_name = 1-conf->active_name;
> +
> /* Step 4, return new stripes to service */
> while(!list_empty(&newstripes)) {
> nsh = list_entry(newstripes.next, struct stripe_head, lru);
> @@ -2303,8 +2307,6 @@ static int resize_stripes(struct r5conf *conf, int newsize)
> }
> /* critical section pass, GFP_NOIO no longer needed */
>
> - conf->slab_cache = sc;
> - conf->active_name = 1-conf->active_name;
> if (!err)
> conf->pool_size = newsize;
> return err;
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] md/raid6: Fix anomily when recovering a single device in RAID6.
From: Shaohua Li @ 2017-04-10 17:34 UTC (permalink / raw)
To: NeilBrown; +Cc: Brad Campbell, Linux-RAID, Dan Williams
In-Reply-To: <87r31adyuj.fsf@notabene.neil.brown.name>
On Mon, Apr 03, 2017 at 12:11:32PM +1000, Neil Brown wrote:
>
> When recoverying a single missing/failed device in a RAID6,
> those stripes where the Q block is on the missing device are
> handled a bit differently. In these cases it is easy to
> check that the P block is correct, so we do. This results
> in the P block be destroy. Consequently the P block needs
> to be read a second time in order to compute Q. This causes
> lots of seeks and hurts performance.
>
> It shouldn't be necessary to re-read P as it can be computed
> from the DATA. But we only compute blocks on missing
> devices, since c337869d9501 ("md: do not compute parity
> unless it is on a failed drive").
>
> So relax the change made in that commit to allow computing
> of the P block in a RAID6 which it is the only missing that
> block.
>
> This makes RAID6 recovery run much faster as the disk just
> "before" the recovering device is no longer seeking
> back-and-forth.
>
> Reported-by-tested-by: Brad Campbell <lists2009@fnarfbargle.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
Applied, thanks, very interesting analysis!
> ---
> drivers/md/raid5.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c523fd69a7bc..aeb2e236a247 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3617,9 +3617,20 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
> BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
> BUG_ON(test_bit(R5_Wantread, &dev->flags));
> BUG_ON(sh->batch_head);
> +
> + /*
> + * In the raid6 case if the only non-uptodate disk is P
> + * then we already trusted P to compute the other failed
> + * drives. It is safe to compute rather than re-read P.
> + * In other cases we only compute blocks from failed
> + * devices, otherwise check/repair might fail to detect
> + * a real inconsistency.
> + */
> +
> if ((s->uptodate == disks - 1) &&
> + ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
> (s->failed && (disk_idx == s->failed_num[0] ||
> - disk_idx == s->failed_num[1]))) {
> + disk_idx == s->failed_num[1])))) {
> /* have disk failed, and we're requested to fetch it;
> * do compute it
> */
> --
> 2.12.0
>
^ permalink raw reply
* Re: [PATCH V2] md/raid10: reset the 'first' at the end of loop
From: Shaohua Li @ 2017-04-10 17:41 UTC (permalink / raw)
To: Guoqing Jiang; +Cc: linux-raid, shli, neilb
In-Reply-To: <1491441138-16155-1-git-send-email-gqjiang@suse.com>
On Thu, Apr 06, 2017 at 09:12:18AM +0800, Guoqing Jiang wrote:
> We need to set "first = 0' at the end of rdev_for_each
> loop, so we can get the array's min_offset_diff correctly
> otherwise min_offset_diff just means the last rdev's
> offset diff.
>
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
applied, thanks!
> ---
> drivers/md/raid10.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 0f13d57..e055ec9 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3769,6 +3769,7 @@ static int raid10_run(struct mddev *mddev)
>
> if (blk_queue_discard(bdev_get_queue(rdev->bdev)))
> discard_supported = true;
> + first = 0;
> }
>
> if (mddev->queue) {
> @@ -4172,6 +4173,7 @@ static int raid10_start_reshape(struct mddev *mddev)
> if (first || diff < min_offset_diff)
> min_offset_diff = diff;
> }
> + first = 0;
> }
>
> if (max(before_length, after_length) > min_offset_diff)
> --
> 2.6.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] md:MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop
From: Shaohua Li @ 2017-04-10 17:47 UTC (permalink / raw)
To: Zhilong Liu; +Cc: shli, linux-raid, NeilBrown
In-Reply-To: <1491448593-12938-1-git-send-email-zlliu@suse.com>
On Thu, Apr 06, 2017 at 11:16:33AM +0800, Zhilong Liu wrote:
> From: NeilBrown <neilb@suse.com>
>
> if called md_set_readonly and set MD_CLOSING bit, the mddev cannot
> be opened any more due to the MD_CLOING bit wasn't cleared. Thus it
> needs to be cleared in md_ioctl after any call to md_set_readonly()
> or do_md_stop().
> Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
thanks, applied! This one looks stable stuff too.
> ---
>
> drivers/md/md.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d6..906a4bf 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6776,6 +6776,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
> void __user *argp = (void __user *)arg;
> struct mddev *mddev = NULL;
> int ro;
> + bool did_set_md_closing = false;
>
> if (!md_ioctl_valid(cmd))
> return -ENOTTY;
> @@ -6865,7 +6866,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
> err = -EBUSY;
> goto out;
> }
> + WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags));
> set_bit(MD_CLOSING, &mddev->flags);
> + did_set_md_closing = true;
> mutex_unlock(&mddev->open_mutex);
> sync_blockdev(bdev);
> }
> @@ -7058,6 +7061,8 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
> mddev->hold_active = 0;
> mddev_unlock(mddev);
> out:
> + if(did_set_md_closing)
> + clear_bit(MD_CLOSING, &mddev->flags);
> return err;
> }
> #ifdef CONFIG_COMPAT
> --
> 2.6.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] md.c:didn't unlock the mddev before return EINVAL in array_size_store
From: Shaohua Li @ 2017-04-10 17:49 UTC (permalink / raw)
To: Zhilong Liu; +Cc: shli, linux-raid
In-Reply-To: <1491804955-7548-1-git-send-email-zlliu@suse.com>
On Mon, Apr 10, 2017 at 02:15:55PM +0800, Zhilong Liu wrote:
> md.c: it needs to release the mddev lock before
> the array_size_store() returns.
applied, thanks!
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
> drivers/md/md.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d6..5327236 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4843,8 +4843,10 @@ array_size_store(struct mddev *mddev, const char *buf, size_t len)
> return err;
>
> /* cluster raid doesn't support change array_sectors */
> - if (mddev_is_clustered(mddev))
> + if (mddev_is_clustered(mddev)) {
> + mddev_unlock(mddev);
> return -EINVAL;
> + }
>
> if (strncmp(buf, "default", 7) == 0) {
> if (mddev->pers)
> --
> 2.6.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] md/raid1: avoid reusing a resync bio after error handling.
From: Shaohua Li @ 2017-04-10 18:04 UTC (permalink / raw)
To: NeilBrown; +Cc: Linux-RAID, Michael Wang
In-Reply-To: <87bmsaguhe.fsf@notabene.neil.brown.name>
On Thu, Apr 06, 2017 at 12:06:37PM +1000, Neil Brown wrote:
>
> fix_sync_read_error() modifies a bio on a newly faulty
> device by setting bi_end_io to end_sync_write.
> This ensure that put_buf() will still call rdev_dec_pending()
> as required, but makes sure that subsequent code in
> fix_sync_read_error() doesn't try to read from the device.
>
> Unfortunately this interacts badly with sync_request_write()
> which assumes that any bio with bi_end_io set to non-NULL
> other than end_sync_read is safe to write to.
>
> As the device is now faulty it doesn't make sense to write.
> As the bio was recently used for a read, it is "dirty"
> and not suitable for immediate submission.
> In particular, ->bi_next might be non-NULL, which will cause
> generic_make_request() to complain.
>
> Break this interaction by refusing to write to devices
> which are marked as Faulty.
>
> Reported-and-tested-by: Michael Wang <yun.wang@profitbricks.com>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Cc: stable@vger.kernel.org (v4.10+)
> Signed-off-by: NeilBrown <neilb@suse.com>
Thanks, applied!
> ---
> drivers/md/raid1.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index a70283753a35..9c1b2231d2db 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2185,6 +2185,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
> (i == r1_bio->read_disk ||
> !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
> continue;
> + if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
> + continue;
>
> bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
> if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
> --
> 2.12.2
>
^ permalink raw reply
* Re: [PATCH 1/4] raid5-ppl: use a single mempool for ppl_io_unit and header_page
From: Shaohua Li @ 2017-04-10 19:09 UTC (permalink / raw)
To: Artur Paszkiewicz; +Cc: linux-raid, songliubraving
In-Reply-To: <20170404111358.14829-2-artur.paszkiewicz@intel.com>
On Tue, Apr 04, 2017 at 01:13:55PM +0200, Artur Paszkiewicz wrote:
Cc: Song, the raid5-cache needs similar fix.
> Allocate both struct ppl_io_unit and its header_page from a shared
> mempool to avoid a possible deadlock. Implement allocate and free
> functions for the mempool, remove the second pool for allocating
> header_page. The header_pages are now freed with their io_units, not
> when the ppl bio completes. Also, use GFP_NOWAIT instead of GFP_ATOMIC
> for allocating ppl_io_unit because we can handle failed allocations and
> there is no reason to utilize emergency reserves.
I applied the last 3 patches, had some nitpicks for this one though
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
> drivers/md/raid5-ppl.c | 53 ++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 36 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index 86ea9addb51a..42e43467d1e8 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -102,7 +102,6 @@ struct ppl_conf {
> struct kmem_cache *io_kc;
> mempool_t *io_pool;
> struct bio_set *bs;
> - mempool_t *meta_pool;
>
> /* used only for recovery */
> int recovered_entries;
> @@ -195,6 +194,33 @@ ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
> return tx;
> }
>
> +static void *ppl_io_pool_alloc(gfp_t gfp_mask, void *pool_data)
> +{
> + struct kmem_cache *kc = pool_data;
> + struct ppl_io_unit *io;
> +
> + io = kmem_cache_alloc(kc, gfp_mask);
> + if (!io)
> + return NULL;
> +
> + io->header_page = alloc_page(gfp_mask);
> + if (!io->header_page) {
> + kmem_cache_free(kc, io);
> + return NULL;
> + }
> +
> + return io;
Maybe directly use GFP_NOWAIT here, we don't use other gfp
> +}
> +
> +static void ppl_io_pool_free(void *element, void *pool_data)
> +{
> + struct kmem_cache *kc = pool_data;
> + struct ppl_io_unit *io = element;
> +
> + __free_page(io->header_page);
> + kmem_cache_free(kc, io);
> +}
> +
> static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> struct stripe_head *sh)
> {
> @@ -202,18 +228,19 @@ static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> struct ppl_io_unit *io;
> struct ppl_header *pplhdr;
>
> - io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> + io = mempool_alloc(ppl_conf->io_pool, GFP_NOWAIT);
> if (!io)
> return NULL;
>
> - memset(io, 0, sizeof(*io));
> io->log = log;
> + io->entries_count = 0;
> + io->pp_size = 0;
> + io->submitted = false;
I'd suggest moving the memset to ppl_io_pool_alloc. Don't think we need to
optimize to avoid setting header_page. And doing memset is less error prone,
for example adding new fields.
Otherwise looks quite good.
Thanks,
Shaohua
> INIT_LIST_HEAD(&io->log_sibling);
> INIT_LIST_HEAD(&io->stripe_list);
> atomic_set(&io->pending_stripes, 0);
> bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
>
> - io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
> pplhdr = page_address(io->header_page);
> clear_page(pplhdr);
> memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
> @@ -369,8 +396,6 @@ static void ppl_log_endio(struct bio *bio)
> if (bio->bi_error)
> md_error(ppl_conf->mddev, log->rdev);
>
> - mempool_free(io->header_page, ppl_conf->meta_pool);
> -
> list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
> list_del_init(&sh->log_list);
>
> @@ -998,7 +1023,6 @@ static void __ppl_exit_log(struct ppl_conf *ppl_conf)
>
> kfree(ppl_conf->child_logs);
>
> - mempool_destroy(ppl_conf->meta_pool);
> if (ppl_conf->bs)
> bioset_free(ppl_conf->bs);
> mempool_destroy(ppl_conf->io_pool);
> @@ -1104,25 +1128,20 @@ int ppl_init_log(struct r5conf *conf)
>
> ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
> if (!ppl_conf->io_kc) {
> - ret = -EINVAL;
> + ret = -ENOMEM;
> goto err;
> }
>
> - ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
> + ppl_conf->io_pool = mempool_create(conf->raid_disks, ppl_io_pool_alloc,
> + ppl_io_pool_free, ppl_conf->io_kc);
> if (!ppl_conf->io_pool) {
> - ret = -EINVAL;
> + ret = -ENOMEM;
> goto err;
> }
>
> ppl_conf->bs = bioset_create(conf->raid_disks, 0);
> if (!ppl_conf->bs) {
> - ret = -EINVAL;
> - goto err;
> - }
> -
> - ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
> - if (!ppl_conf->meta_pool) {
> - ret = -EINVAL;
> + ret = -ENOMEM;
> goto err;
> }
>
> --
> 2.11.0
>
^ permalink raw reply
* Re: [PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios
From: Anthony Youngman @ 2017-04-10 19:40 UTC (permalink / raw)
To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
philipp.reisner, lars.ellenberg
Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170405142205.6477-4-hch@lst.de>
s/past/paste/
On 05/04/17 15:21, Christoph Hellwig wrote:
> Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
> that limit the write zeroes size.
Cheers,
Wol
^ permalink raw reply
* Re: 4.10 + 765d704db: no improvemtn in write rates with md/raid5 group_thread_cnt > 0
From: Shaohua Li @ 2017-04-10 20:10 UTC (permalink / raw)
To: Nix; +Cc: linux-raid, Shaohua Li
In-Reply-To: <87zifvj61v.fsf@esperi.org.uk>
On Wed, Apr 05, 2017 at 03:13:48PM +0100, Nix wrote:
> So you'd expect write rates on a RAID-5 array to be higher than write rates on a
> single spinning-rust disk, right? Because, even with Shaohua's commit
> 765d704db1f583630d52 applied atop 4.10, I see little sign of it. Does this
> commit depend upon something else to stop death by seeking with
> group_thread_cnt > 0? It didn't look like it to me...
>
> The results Shaohua showed in the original commit were very impressive, but for
> the life of me I can't figure out how to get anything like them.
That only works well with large iodepth. For single write, we are still far
from the BW in theory. I actually wrote in the commit log:
"We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline."
Thanks,
Shaohua
>
> With group_thread_cnt 0, I max out at a bit higher than the 240MiB/s one disk in
> this array can manage on its own, for obvious reasons: md_raid5 CPU saturation.
> (This is with a 512KiB chunksize, stripe_cache_size of 512: yes, I know that's
> small, it's just a random slice taken out of a much larger test series: the
> array is a smallish non-degraded unjournalled four-element md5 initialized with
> --assume-clean for benchmarking). Similar results are seen with ext4 and xfs.
> Trimmed-down iozone -a output, so only one serial writer, but still:
>
> stride
> kB reclen write rewrite read reread read
> 64 4 6752 15647 26489 30145 26678
> 64 8 6639 25236 45101 56289 43158
> 64 16 6014 9799 67364 89009 60900
> 64 32 35200 48781 7374 177207 7336
> 64 64 32420 70551 109395 229470 97868
> [...]
> 32768 64 28181 30576 265403 178438 299889
> 32768 128 41659 39989 319709 320689 330949
> 32768 256 45402 44555 320689 357564 451256
> 32768 512 42559 40556 177862 299744 466529
> 32768 1024 68005 52814 415747 391507 706177
> 32768 2048 91701 103918 520689 540128 1061339
> 32768 4096 177716 169486 487277 514111 683463
> 32768 8192 218923 233152 539853 616869 453021
> 32768 16384 199068 198872 569353 619913 535240
> [...]
> 262144 64 25148 32423 385802 378681 27762
> 262144 128 42510 41626 436994 380669 48004
> 262144 256 43415 44004 436209 418971 76697
> 262144 512 41408 40399 342862 401145 116781
> 262144 1024 68870 59341 465737 507454 265154
> 262144 2048 101994 91693 589277 582836 296474
> 262144 4096 176852 166200 581922 649215 421253
> 262144 8192 226696 221838 601174 633347 569766
> 262144 16384 307843 297985 644679 659060 569302
> 524288 64 25155 24527 392401 401908 21461
> 524288 128 41422 41525 433156 464331 35360
> 524288 256 42059 43742 443281 415799 70171
> 524288 512 41253 39360 414306 428993 75387
> 524288 1024 66081 61151 498880 517952 186959
> 524288 2048 101272 90418 610467 623258 274331
> 524288 4096 171489 173381 601689 576333 314290
> 524288 8192 220943 215226 641713 607459 444827
> 524288 16384 289055 296340 651010 671623 503633
>
> Read rates are as high as I'd expect for a four-disk RAID-5 array, and the
> sequential write output rates, while higher than one one disk can manage, are
> thresholded here by the performance of the md I/O thread, as expected.
>
> If I boost group_thread_cnt to, say, 2, I see:
>
> 64 4 3677 14565 27936 36056 29629
> 64 8 6670 21608 53422 69187 32045
> 64 16 6682 26209 70329 103891 66662
> 64 32 28624 40048 7312 154556 7345
> 64 64 38327 43213 89127 260160 90540
> [...]
> 32768 64 14328 18580 265136 282946 308082
> 32768 128 26310 24803 265762 323414 354685
> 32768 256 29115 27073 238659 308974 345723
> 32768 512 21572 21345 293312 314086 345365
> 32768 1024 43978 38071 395715 345161 545821
> 32768 2048 82898 70840 293151 470398 922082
> 32768 4096 143350 124658 391980 659819 617984
> 32768 8192 164297 227661 570423 645141 515009
> 32768 16384 157701 171804 568484 451448 350715
> [...]
> 262144 64 17150 17693 391561 382751 28374
> 262144 128 25385 26498 423685 410359 47148
> 262144 256 29219 30244 392992 421748 80403
> 262144 512 24303 24686 399453 371882 122861
> 262144 1024 42296 42535 403020 508195 261339
> 262144 2048 75740 63125 606979 589329 296124
> 262144 4096 134646 137543 562749 590893 392938
> 262144 8192 237800 239847 631752 620766 475791
> 262144 16384 267889 304517 635674 612164 598521
> 524288 64 17691 17776 403333 374628 21673
> 524288 128 25575 25609 396568 439018 34526
> 524288 256 29984 29990 412587 437099 71650
> 524288 512 24971 25599 403074 431581 75471
> 524288 1024 42545 43657 505740 519112 200811
> 524288 2048 72519 75604 559987 589069 257654
> 524288 4096 135122 140745 622450 499336 331273
> 524288 8192 232848 231307 592729 604849 432296
> 524288 16384 280105 271252 647725 664868 472363
>
> Larger writes are clearly still thresholded.
>
> Boost the thread count more, here, to 8:
>
> 64 4 7834 14388 30346 40300 6124
> 64 8 17236 21282 6984 37644 6842
> 64 16 21100 24720 7208 120277 7199
> 64 32 29411 45553 7374 162411 7357
> 64 64 3671 59588 78128 256923 82804
> [...]
> 32768 64 14261 17866 261303 289135 294245
> 32768 128 25832 27639 298172 324766 342822
> 32768 256 26477 27196 277318 339353 352967
> 32768 512 17848 19875 339424 272225 387746
> 32768 1024 36017 38945 482068 464194 110825
> 32768 2048 64240 67976 551762 505772 76629
> 32768 4096 71022 117680 578561 696507 752493
> 32768 8192 161080 207790 564343 556796 546488
> 32768 16384 172937 233103 521368 603562 418679
> [...]
> 262144 64 17170 17452 352337 351258 27824
> 262144 128 25318 25522 418977 424859 47112
> 262144 256 26405 27092 426170 419684 79047
> 262144 512 20185 20271 398733 411974 135554
> 262144 1024 39013 38238 497919 438150 180384
> 262144 2048 71054 70921 586634 535676 258955
> 262144 4096 113222 121554 616548 604177 293088
> 262144 8192 184086 187845 551395 586126 496147
> 262144 16384 286319 272419 645900 659103 589384
> 524288 64 16980 16756 385746 381476 21462
> 524288 128 24993 25482 428855 438250 34889
> 524288 256 26517 26134 448088 395352 70225
> 524288 512 19534 19484 418764 416630 76975
> 524288 1024 37645 38370 514030 511638 177818
> 524288 2048 68469 72200 602688 542627 251162
> 524288 4096 115467 121220 598738 629120 289589
> 524288 8192 185093 182044 621233 586919 437162
> 524288 16384 250990 266257 620428 660663 494770
>
> Still thresholded. Yes, this is only one serial writer, but nonetheless this
> seems a bit od.
>
> --
> NULL && (void)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Shrinking an array
From: Wakko Warner @ 2017-04-11 0:30 UTC (permalink / raw)
To: linux-raid
I have a question about shrinking an array. My current array is 4x 2tb
disks in raid6 (md0). The array was created on the 2nd partition of each
disk and spans most of the disk. I would like to replace the 2tb disks with
750gb disks. md0 is a luks container with lvm underneath. I have less than
1tb actually in use. What would the recommended procedure be for shrinking
this? I've watched this list, but I don't think I've come across anyone
actually wanting to do this before.
I'm thinking of these steps already:
1) Shrink PV.
2) Shrink luks. I'm aware that there is not size metadata, but the dm
mapping would need to be shrunk.
3) Shrink md0. I did this once when I changed a 6 drive raid6 into a 5
drive raid6. Would I use --array-size= or --size= ? I understand the
difference is the size of md0 vs the individual members.
So for number 4, if md0 is now small enough, will it accept a member that is
smaller? If so, I should beable to add the member to the array and issue
--replace.
Thanks.
--
Microsoft has beaten Volkswagen's world record. Volkswagen only created 22
million bugs.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox