Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Level change from 4 disk RAID5 to 4 disk RAID6
From: NeilBrown @ 2017-04-10  1:04 UTC (permalink / raw)
  To: LM, linux-raid
In-Reply-To: <20170408214239.GF10267@lars-laptop>

[-- Attachment #1: Type: text/plain, Size: 1632 bytes --]

On Sat, Apr 08 2017, LM wrote:

> Hi,
>
> I have a 4 disk RAID5, the used dev size is 640.05 GB. Now I want to
> replace the 4 disks by 4 disks with a size of 2TB each.
>
> As far as I understand the man page, this can be achieved by replacing
> the devices one after another and for each device rebuild the degraded
> array with:
>
>    mdadm /dev/md0 --add /dev/sdX1
>
> Then the level change can be done together with growing the array:
>
>    mdadm --grow /dev/md0 --level=raid6 --backup-file=/root/backup-md0
>
> Does this work?
>
> I am asking if it works, because the man page also says:
>
>> mdadm --grow /dev/md4 --level=6 --backup-file=/root/backup-md4
>>        The array /dev/md4 which is currently a RAID5 array will
>>        be converted to RAID6.  There should normally already be
>>        a spare drive attached to the array as a RAID6 needs one
>>        more drive than a matching RAID5.
>
> And in my case only the size of disks is increased, not their number.
>

Yes, it probably works, and you probably don't need a backup file.
Though you might need to explicitly tell mdadm to keep the number of
devices unchanged by specifying "--raid-disk=4".

You probably aren't very encouraged that I say "probably" and "might",
and this is deliberate.

I recommend that you crate 4 10Meg files, use losetup to create 10M
devices, and build a RAID5 over them with --size=5M.
Then try the --grow --level=6 command, and see what happens.
If you mess up, you can easily start from scratch and try again.
If it works, you can have some confidence that the same process will
have the same result on real devices.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* (unknown), 
From: hp @ 2017-04-10  3:30 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: 7718637436266_linux-raid.zip --]
[-- Type: application/zip, Size: 3603 bytes --]

^ permalink raw reply

* [PATCH] mdadm.c:fix compile warning "mdfd is uninitialized"
From: Zhilong Liu @ 2017-04-10  4:49 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <016b495a-1361-0f29-0fcb-6af008625565@gmail.com>

Initialized the mdfd as -1 to prevent compile error
of some compilers.
For example, gcc version 4.8.5(SUSE Linux).

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 mdadm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mdadm.c b/mdadm.c
index 001ff68..41dae1d 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1916,7 +1916,7 @@ static int misc_list(struct mddev_dev *devlist,
 	int rv = 0;
 
 	for (dv = devlist; dv; dv = (rv & 16) ? NULL : dv->next) {
-		int mdfd;
+		int mdfd = -1;
 
 		switch(dv->disposition) {
 		case 'D':
-- 
2.6.6


^ permalink raw reply related

* Re: Level change from 4 disk RAID5 to 4 disk RAID6
From: Wols Lists @ 2017-04-10  5:41 UTC (permalink / raw)
  To: LM, linux-raid
In-Reply-To: <20170408214239.GF10267@lars-laptop>

On 08/04/17 22:42, LM wrote:
> Hi,
> 
> I have a 4 disk RAID5, the used dev size is 640.05 GB. Now I want to
> replace the 4 disks by 4 disks with a size of 2TB each.
> 
> As far as I understand the man page, this can be achieved by replacing
> the devices one after another and for each device rebuild the degraded
> array with:
> 
>    mdadm /dev/md0 --add /dev/sdX1

Do you have a spare SATA port or whatever your drives are. If so, then
use the --replace option to mdadm, don't fail then add. You're risking a
drive failure taking out your array - not a good move.

And if you don't have a spare port, $20 for a PCI card or whatever is a
good investment to keep your data safe.

Have a look at the raid wiki - it tries to be a bit more verbose and
easily comprehensible than the man page.

Cheers,
Wol

^ permalink raw reply

* [PATCH] md.c:didn't unlock the mddev before return EINVAL in array_size_store
From: Zhilong Liu @ 2017-04-10  6:15 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Zhilong Liu

md.c: it needs to release the mddev lock before
the array_size_store() returns.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 drivers/md/md.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f6ae1d6..5327236 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4843,8 +4843,10 @@ array_size_store(struct mddev *mddev, const char *buf, size_t len)
 		return err;
 
 	/* cluster raid doesn't support change array_sectors */
-	if (mddev_is_clustered(mddev))
+	if (mddev_is_clustered(mddev)) {
+		mddev_unlock(mddev);
 		return -EINVAL;
+	}
 
 	if (strncmp(buf, "default", 7) == 0) {
 		if (mddev->pers)
-- 
2.6.6


^ permalink raw reply related

* Re: [v2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome
From: Michael Ellerman @ 2017-04-10  6:54 UTC (permalink / raw)
  To: Daniel Axtens, Matt Brown, linuxppc-dev; +Cc: linux-raid
In-Reply-To: <87r316hxmi.fsf@possimpible.ozlabs.ibm.com>

Daniel Axtens <dja@axtens.net> writes:

> Hi Matt,
>
> Thanks for answering my questions and doing those fixes.
>
>
>> Bugs fixed:
>> 	- A small bug in pq.h regarding a missing and mismatched
>> 	  ifdef statement
>> 	- Fixed test/Makefile to correctly build test on ppc
>>
>
> I think this commit should be labelled:
> Fixes: 4f8c55c5ad49 ("lib/raid6: build proper files on corresponding arch")
>
> mpe can probably add that when he merges - no need to do a new version :)

Please send a separate patch which does that fix.

>>  else
>> -        HAS_ALTIVEC := $(shell printf '\#include <altivec.h>\nvector int a;\n' |\
>> -                         gcc -c -x c - >&/dev/null && \
>> -                         rm ./-.o && echo yes)
>> -        ifeq ($(HAS_ALTIVEC),yes)
>> -                OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
>> +	 HAS_ALTIVEC := $(shell printf '\#include <altivec.h>\nvector int a;\n' |\
>> +			 gcc -c -x c - >/dev/null && rm ./-.o && echo yes)
>> +	 ifeq ($(HAS_ALTIVEC),yes)
>> +		CFLAGS += -I../../../arch/powerpc/include
>> +		CFLAGS += -DCONFIG_ALTIVEC
>> +		OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
>> +			vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
>>          endif
>>  endif
> Looks like vim has replaced spaces with tabs here. Not sure how much we
> care...

We care at least because it makes the diff look bigger than it really
is, if I'm reading it right the first three lines haven't actually
changed.

>> @@ -97,6 +99,18 @@ altivec4.c: altivec.uc ../unroll.awk
>>  altivec8.c: altivec.uc ../unroll.awk
>>  	$(AWK) ../unroll.awk -vN=8 < altivec.uc > $@
>>
> ... especially seeing as tabs are already used in the file here!

It's a Makefile! Tabs have meaning :)

>> +# ifdef __KERNEL__
>> +	return (cpu_has_feature(CONFIG_ALTIVEC) &&
>> +		cpu_has_feature(CPU_FTR_ARCH_207S));
> I think CPU_FTR_ARCH_207S implies Altivec? Again, not a real problem,

It doesn't.

And also CONFIG_ALTIVEC is not a cpu feature!

You should be using CPU_FTR_ALTIVEC_COMP. That copes with the case where
the kernel is compiled without ALTIVEC support.

cheers

^ permalink raw reply

* Re: [PATCH] md.c:didn't unlock the mddev before return EINVAL in array_size_store
From: Guoqing Jiang @ 2017-04-10  7:23 UTC (permalink / raw)
  To: Zhilong Liu, shli; +Cc: linux-raid
In-Reply-To: <1491804955-7548-1-git-send-email-zlliu@suse.com>


On 04/10/2017 02:15 PM, Zhilong Liu wrote:
> md.c: it needs to release the mddev lock before
> the array_size_store() returns.

Fixes: ab5a98b132fd ("md-cluster: change array_sectors and update size 
are not supported")

>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>   drivers/md/md.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d6..5327236 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4843,8 +4843,10 @@ array_size_store(struct mddev *mddev, const char *buf, size_t len)
>   		return err;
>   
>   	/* cluster raid doesn't support change array_sectors */
> -	if (mddev_is_clustered(mddev))
> +	if (mddev_is_clustered(mddev)) {
> +		mddev_unlock(mddev);
>   		return -EINVAL;
> +	}

Reviewed-by: Guoqing Jiang <gqjiang@suse.com>

Thanks,
Guoqing

^ permalink raw reply

* [PATCH 0/2] mdadm/manpage: update manpage for readonly and array-size
From: Zhilong Liu @ 2017-04-10  8:01 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

Hi, Jes;
  These two patches is to update the description for readonly and array-size
sectors.

Thanks,
Zhilong

---

Zhilong Liu (2):
  mdadm/manpage:update description for readonly in manpage
  mdadm/manpage:clustered array doesn't support --array-size yet

 mdadm.8.in | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

-- 
2.6.6


^ permalink raw reply

* [PATCH 1/2] mdadm/manpage:update description for readonly in manpage
From: Zhilong Liu @ 2017-04-10  8:04 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <1491811296-9118-1-git-send-email-zlliu@suse.com>

update readonly/readwrite in man page:
Currently both the readwrite and readonly are worked well,
thus updates description for them.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 mdadm.8.in | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mdadm.8.in b/mdadm.8.in
index 744c12b..f006cf5 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -925,7 +925,8 @@ will not try to be so clever.
 Start the array
 .B read only
 rather than read-write as normal.  No writes will be allowed to the
-array, and no resync, recovery, or reshape will be started.
+array, and no resync, recovery, or reshape will be started. It works with
+Create, Assemble, Manage and Misc mode.
 
 .TP
 .BR \-a ", " "\-\-auto{=yes,md,mdp,part,p}{NN}"
@@ -2232,7 +2233,7 @@ be in use.
 
 .TP
 .B \-\-readonly
-start the array readonly \(em not supported yet.
+start the array with readonly status.
 
 .SH MANAGE MODE
 .HP 12
@@ -2438,12 +2439,11 @@ This will fully activate a partially assembled md array.
 
 .TP
 .B \-\-readonly
-This will mark an active array as read-only, providing that it is
-not currently being used.
+This will set an active array as read-only status.
 
 .TP
 .B \-\-readwrite
-This will change a
+This will change an
 .B readonly
 array back to being read/write.
 
-- 
2.6.6


^ permalink raw reply related

* [PATCH 2/2] mdadm/manpage:clustered array doesn't support --array-size yet
From: Zhilong Liu @ 2017-04-10  8:04 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <1491811478-9234-1-git-send-email-zlliu@suse.com>

update man page for --array-size:
clustered array isn't allowed to change array_sector by now.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 mdadm.8.in | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mdadm.8.in b/mdadm.8.in
index f006cf5..3555516 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -541,6 +541,10 @@ A value of
 restores the apparent size of the array to be whatever the real
 amount of available space is.
 
+The
+.B clustered
+array isn't supported this parameter yet.
+
 .TP
 .BR \-c ", " \-\-chunk=
 Specify chunk size of kilobytes.  The default when creating an
-- 
2.6.6


^ permalink raw reply related

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: Nix @ 2017-04-10  9:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: jes.sorensen, linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <878tn9yymi.fsf@notabene.neil.brown.name>

On 10 Apr 2017, NeilBrown verbalised:

> On Fri, Apr 07 2017, jes.sorensen@gmail.com wrote:
>
>> Next question since I am wearing my 'what is this old stuff doing' hat.
>> mdassemble? Does anything still use this? The reason is a lot of the
>> newer features are explicitly included, and switching to sysfs is
>> effectively going to kill it, unless it gets a major upgrade.
>>
>
> I was never a big fan, of mdassemble, but it is smaller than mdadm and
> some people apparently have (or had) space-constrained boot
> environments.

It also has fewer build-time requirements and can build on systems with
things like old buggy versions of uclibc that can't build a working
mdadm. (Of course, now musl exists, I'm not sure anyone should care
about that. I stopped caring many years ago.)

-- 
NULL && (void)

^ permalink raw reply

* Re: md-cluster Oops 4.9.13
From: Marc Smith @ 2017-04-10 13:25 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid
In-Reply-To: <58E45E0C.6030705@gmail.com>

Hi,

Sorry for the delay... I was hoping to cherry-pick this and test
against 4.9.x, but it didn't apply cleanly, although it looks trivial
to do it by hand. Is it recommended/okay to test this patch against
4.9.x? Will the fix eventually be merged into 4.9.x?


--Marc

On Tue, Apr 4, 2017 at 11:01 PM, Guoqing Jiang <jgq516@gmail.com> wrote:
>
>
> On 04/04/2017 10:06 PM, Marc Smith wrote:
>>
>> Hi,
>>
>> I encountered an oops this morning when stopping a MD array
>> (md-cluster)... there were 4 md-cluster array started, and they were
>> in the middle of a rebuild. I stopped the first one and then stopped
>> the second one immediately after and got the oops, here is a
>> transcript of what was on my terminal session:
>>
>> [root@brimstone-1b ~]# mdadm --stop /dev/md/array1
>> mdadm: stopped /dev/md/array1
>> [root@brimstone-1b ~]# mdadm --stop /dev/md/array2
>>
>> Message from syslogd@brimstone-1b at Tue Apr  4 09:54:40 2017 ...
>> brimstone-1b kernel: [649162.174685] BUG: unable to handle kernel NULL
>> pointer dereference at 0000000000000098
>>
>> Using Linux 4.9.13 and here is the output from the kernel messages:
>>
>> --snip--
>> [649158.014731] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: leaving the
>> lockspace group...
>> [649158.015233] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: group event
>> done 0 0
>> [649158.015303] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8:
>> release_lockspace final free
>> [649158.015331] md: unbind<nvme0n1p1>
>> [649158.042540] md: export_rdev(nvme0n1p1)
>> [649158.042546] md: unbind<nvme1n1p1>
>> [649158.048501] md: export_rdev(nvme1n1p1)
>> [649161.759022] md127: detected capacity change from 1000068874240 to 0
>> [649161.759025] md: md127 stopped.
>> [649162.174685] BUG: unable to handle kernel NULL pointer dereference
>> at 0000000000000098
>> [649162.174727] IP: [<ffffffff81868b40>] recv_daemon+0x1e9/0x373
>
>
> Looks like the recv_daemon is still running after stop array, commit
> 48df498 "md: move bitmap_destroy to the beginning of __md_stop"
> ensure it won't happen.
>
>
> [snip]
>
>> Perhaps this is already fixed in later versions? Let me know if you
>> need any additional information.
>
>
> Could you pls try with the latest version? Please let me know if you
> still see it, thanks.
>
> Regards,
> Guoqing
>

^ permalink raw reply

* Re: [RFC PATCH v5] IV Generation algorithms for dm-crypt
From: Milan Broz @ 2017-04-10 14:00 UTC (permalink / raw)
  To: Binoy Jayan, Oded, Ofir
  Cc: Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, linux-kernel, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Milan Broz, Gilad
In-Reply-To: <1491562064-23591-1-git-send-email-binoy.jayan@linaro.org>

On 04/07/2017 12:47 PM, Binoy Jayan wrote:
> ===============================================================================
> dm-crypt optimization for larger block sizes
> ===============================================================================
...
> Tests with dd [direct i/o]
> 
> Sequential read            -0.134 %
> Sequential Write           +0.091 %
> 
> Tests with fio [Aggregate bandwidth - aggrb]
> 
> Random Read                +0.358 %
> Random Write               +0.010 %
> 
> Tests with bonnie++ [768 MiB File, 384 MiB Ram]
> after mounting dm-crypt target as ext4
> 
> Sequential o/p [per-char]  -2.876 %
> Sequential o/p [per-blk]   +0.992 %
> Sequential o/p [re-write]  +4.465 %
> 
> Sequential i/p [per-char] -0.453 %
> Sequential i/p [per-blk]  -0.740 %
> 
> Sequential create         -0.255 %
> Sequential delete         +0.042 %
> Random create             -0.007 %
> Random delete             +0.454 %
> 
> NB: The '+' sign shows improvement and '-' shows degradation.
>     The tests were performed with minimal cpu load.
>     Tests with higher cpu load to be done

Well, it is good that there is no performance degradation but it
would be nice to have some user of it that proves it is really
working for your hw.

FYI - with patch that increases dmcrypt sector size to 4k
I can see improvement in speed usually in 5-15% with sync AES-NI
(depends on access pattern), with dmcrypt mapped to memory
it is even close to 20% speed up (but such a configuration is
completely artificial).

I wonder why increased dmcrypt sector size does not work for your hw,
it should help as well (and can be combiuned later with this IV approach).
(For native 4k drives this should be used in future anyway...)

Milan

^ permalink raw reply

* Re: [PATCH 2/7] Makefile, x86, LLVM: disable unsupported optimization flags
From: Masahiro Yamada @ 2017-04-10 14:54 UTC (permalink / raw)
  To: Michael Davidson
  Cc: Matthias Kaehlcke, Michal Marek, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Herbert Xu, David S. Miller, Shaohua Li,
	Alexander Potapenko, Dmitry Vyukov, X86 ML,
	Linux Kbuild mailing list, Linux Kernel Mailing List,
	linux-crypto, linux-raid, Arnd Bergmann
In-Reply-To: <CA+=D-XVbZfgaQ+ZrjctAF5Ek3kaF3XqTLsSu==hK9savDztKsw@mail.gmail.com>

Hi.


2017-04-06 4:11 GMT+09:00 Michael Davidson <md@google.com>:
> It "works" for the cases that I currently care about but I have to say
> that I am uneasy about adding -Werror to the cc-option test in this
> way.
>
> Suppose that one of the *other* flags that is implicitly passed to the
> compiler by cc-option - eg something that was explicitly specified in
> $(KBUILD_CFLAGS) - triggers a warning. In that case all calls to
> cc-option will silently fail because of the -Werror and valid options
> will not be detected correctly.

Theoretically, options explicitly specified in KBUILD_CFLAGS
should be always valid.
Options that may not be supported in some cases
should be wrapped with $(call cc-option ).



> If everyone is OK with that because "it shouldn't normally ever
> happen" then that is fine, but if does result in a subtle change from
> existing behavior (and a trap that I almost immediately fell into
> after applying a similar patch).

There is a rare case where a particular combination fails
(such as the conflict between -pg and -ffunction-sections
as reported in https://patchwork.kernel.org/patch/9624573/).

In a such case, we may end up with swapping the order,
but this should not happen quite often.



> On Wed, Apr 5, 2017 at 12:01 PM, Matthias Kaehlcke <mka@chromium.org> wrote:
>> Hi Masahiro,
>>
>> El Thu, Apr 06, 2017 at 03:08:26AM +0900 Masahiro Yamada ha dit:
>>
>>> 2017-03-17 9:15 GMT+09:00 Michael Davidson <md@google.com>:
>>> > Unfortunately, while clang generates a warning about these flags
>>> > being unsupported it still exits with a status of 0 so we have
>>> > to explicitly disable them instead of just using a cc-option check.
>>> >
>>> > Signed-off-by: Michael Davidson <md@google.com>
>>>
>>>
>>> Instead, does the following work for you?
>>> https://patchwork.kernel.org/patch/9657285/
>>
>> Thanks for the pointer, I was about to give this change (or rather its
>> ancestor) a rework myself :)
>>
>>> You need to use
>>> $(call cc-option, ...)
>>> for -falign-jumps=1 and -falign-loops=1
>>
>> I can confirm that this works.
>>
>> Thanks
>>
>> Matthias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [PATCH v2] md/r5cache: gracefully handle journal device errors for writeback mode
From: Shaohua Li @ 2017-04-10 16:21 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch,
	jes.sorensen
In-Reply-To: <20170329080013.1445439-1-songliubraving@fb.com>

On Wed, Mar 29, 2017 at 01:00:13AM -0700, Song Liu wrote:
> For the raid456 with writeback cache, when journal device failed during
> normal operation, it is still possible to persist all data, as all
> pending data is still in stripe cache. However, it is necessary to handle
> journal failure gracefully.
> 
> During journal failures, this patch makes the follow changes to land data
> in cache to raid disks gracefully:
> 
> 1. In raid5_remove_disk(), flush all cached stripes;
> 2. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
>    to make progress;
> 3. In delay_towrite(), only process data in the cache (skip dev->towrite);
> 4. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
>    in loprio_list
> 5. In r5l_do_submit_io(), submit io->split_bio first (see inline comments
>    for details). 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 27 ++++++++++++++++++---------
>  drivers/md/raid5.c       | 28 ++++++++++++++++++++++++----
>  2 files changed, 42 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index b6194e0..0838617 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -632,20 +632,29 @@ static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
>  	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
>  	spin_unlock_irqrestore(&log->io_list_lock, flags);
>  
> +	/*
> +	 * In case of journal device failures, submit_bio will get error
> +	 * and calls endio, then active stripes will continue write
> +	 * process. Therefore, it is not necessary to check Faulty bit
> +	 * of journal device here.
> +	 *
> +	 * However, calling r5l_log_endio(current_bio) may change
> +	 * split_bio. Therefore, it is necessary to check split_bio before
> +	 * submit current_bio.
> +	 */

sorry, for the delay. what did you mean 'calling r5l_log_endio may change
split_bio'? The split_bio is chained into current_bio. The endio of
current_bio(r5l_log_endio) is only called after all chained bio completion. I
didn't get the point why this change.

> +	if (io->split_bio) {
> +		if (io->has_flush)
> +			io->split_bio->bi_opf |= REQ_PREFLUSH;
> +		if (io->has_fua)
> +			io->split_bio->bi_opf |= REQ_FUA;
> +		submit_bio(io->split_bio);
> +	}
> +
>  	if (io->has_flush)
>  		io->current_bio->bi_opf |= REQ_PREFLUSH;
>  	if (io->has_fua)
>  		io->current_bio->bi_opf |= REQ_FUA;
>  	submit_bio(io->current_bio);
> -
> -	if (!io->split_bio)
> -		return;
> -
> -	if (io->has_flush)
> -		io->split_bio->bi_opf |= REQ_PREFLUSH;
> -	if (io->has_fua)
> -		io->split_bio->bi_opf |= REQ_FUA;
> -	submit_bio(io->split_bio);
>  }
>  
>  /* deferred io_unit will be dispatched here */
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 6036d5e..4d3d1ab 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3054,6 +3054,11 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
>   *      When LOG_CRITICAL, stripes with injournal == 0 will be sent to
>   *      no_space_stripes list.
>   *
> + *   3. during journal failure
> + *      In journal failure, we try to flush all cached data to raid disks
> + *      based on data in stripe cache. The array is read-only to upper
> + *      layers, so we would skip all pending writes.
> + *
>   */
>  static inline bool delay_towrite(struct r5conf *conf,
>  				 struct r5dev *dev,
> @@ -3067,6 +3072,9 @@ static inline bool delay_towrite(struct r5conf *conf,
>  	if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
>  	    s->injournal > 0)
>  		return true;
> +	/* case 3 above */
> +	if (s->log_failed && s->injournal)
> +		return true;
>  	return false;
>  }
>  
> @@ -4689,10 +4697,15 @@ static void handle_stripe(struct stripe_head *sh)
>  	       " to_write=%d failed=%d failed_num=%d,%d\n",
>  	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
>  	       s.failed_num[0], s.failed_num[1]);
> -	/* check if the array has lost more than max_degraded devices and,
> +	/*
> +	 * check if the array has lost more than max_degraded devices and,
>  	 * if so, some requests might need to be failed.
> +	 *
> +	 * When journal device failed (log_failed), we will only process
> +	 * the stripe if there is data need write to raid disks
>  	 */
> -	if (s.failed > conf->max_degraded || s.log_failed) {
> +	if (s.failed > conf->max_degraded ||
> +	    (s.log_failed && s.injournal == 0)) {
>  		sh->check_state = 0;
>  		sh->reconstruct_state = 0;
>  		break_stripe_batch_list(sh, 0);
> @@ -5272,7 +5285,8 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
>  	struct list_head *handle_list = NULL;
>  	struct r5worker_group *wg;
>  	bool second_try = !r5c_is_writeback(conf->log);
> -	bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state) ||
> +		r5l_log_disk_error(conf);
>  
>  again:
>  	wg = NULL;
> @@ -7526,6 +7540,7 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  	int number = rdev->raid_disk;
>  	struct md_rdev **rdevp;
>  	struct disk_info *p = conf->disks + number;
> +	unsigned long flags;
>  
>  	print_raid5_conf(conf);
>  	if (test_bit(Journal, &rdev->flags) && conf->log) {
> @@ -7535,7 +7550,12 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  		 * neilb: there is no locking about new writes here,
>  		 * so this cannot be safe.
>  		 */
> -		if (atomic_read(&conf->active_stripes)) {
> +		if (atomic_read(&conf->active_stripes) ||
> +		    atomic_read(&conf->r5c_cached_full_stripes) ||
> +		    atomic_read(&conf->r5c_cached_partial_stripes)) {
> +			spin_lock_irqsave(&conf->device_lock, flags);
> +			r5c_flush_cache(conf, INT_MAX);
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
>  			return -EBUSY;

It's weird this is called in raid5_remove_disk, shouldn't this be called in log
disk error in case user doesn't remove the log disk? And this is a policy
change. User might not want to do the flush, as this exposes write hole. I
think at least we should print info out here to warn user the flush.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] md: update slab_cache before releasing new stripes when stripes resizing
From: Shaohua Li @ 2017-04-10 16:25 UTC (permalink / raw)
  To: Dennis Yang; +Cc: linux-raid
In-Reply-To: <1490773573-32692-1-git-send-email-dennisyang@qnap.com>

On Wed, Mar 29, 2017 at 03:46:13PM +0800, Dennis Yang wrote:
> When growing raid5 device on machine with small memory, there is chance that
> mdadm will be killed and the following bug report can be observed. The same
> bug could also be reproduced in linux-4.10.6.
> 
> [57600.075774] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
> [57600.114678] Oops: 0002 [#1] SMP
> [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P           O    4.2.8 #1
> [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
> [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
> [57600.204963] RIP: 0010:[<ffffffff81a6aa87>]  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.213057] RSP: 0018:ffff880043073810  EFLAGS: 00010046
> [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
> [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
> [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
> [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
> [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
> [57600.253999] FS:  00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
> [57600.262078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
> [57600.274942] Stack:
> [57600.276949]  ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
> [57600.284383]  ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
> [57600.291820]  ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
> [57600.299254] Call Trace:
> [57600.301698]  [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
> [57600.307523]  [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
> [57600.313779]  [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
> [57600.319344]  [<ffffffff81579942>] drop_one_stripe+0x62/0x90
> [57600.324915]  [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
> [57600.330563]  [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
> [57600.336650]  [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
> [57600.342039]  [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
> [57600.348210]  [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
> [57600.353959]  [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
> [57600.360303]  [<ffffffff8157a30b>] check_reshape+0x62b/0x770
> [57600.365866]  [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
> [57600.371778]  [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
> [57600.377604]  [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
> [57600.382827]  [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
> [57600.388307]  [<ffffffff81195238>] block_ioctl+0x38/0x40
> [57600.393525]  [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
> [57600.399010]  [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
> [57600.404400]  [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
> [57600.409447]  [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
> [57600.435460] RIP  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.441208]  RSP <ffff880043073810>
> [57600.444690] CR2: 0000000000000000
> [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---
> 
> The problem is that resize_stripes() releases new stripe_heads before assigning new
> slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
> after resize_stripes() starting releasing new stripes but right before new slab cache
> being assigned, it is possible that these new stripe_heads will be freed with the old
> slab_cache which was already been destoryed and that triggers this bug.

applied, thanks!

 
> Signed-off-by: Dennis Yang <dennisyang@qnap.com>
> ---
>  drivers/md/raid5.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 6661db2c..172edc1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2286,6 +2286,10 @@ static int resize_stripes(struct r5conf *conf, int newsize)
>  		err = -ENOMEM;
>  
>  	mutex_unlock(&conf->cache_size_mutex);
> +	
> +	conf->slab_cache = sc;
> +	conf->active_name = 1-conf->active_name;
> +
>  	/* Step 4, return new stripes to service */
>  	while(!list_empty(&newstripes)) {
>  		nsh = list_entry(newstripes.next, struct stripe_head, lru);
> @@ -2303,8 +2307,6 @@ static int resize_stripes(struct r5conf *conf, int newsize)
>  	}
>  	/* critical section pass, GFP_NOIO no longer needed */
>  
> -	conf->slab_cache = sc;
> -	conf->active_name = 1-conf->active_name;
>  	if (!err)
>  		conf->pool_size = newsize;
>  	return err;
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] md/raid6: Fix anomily when recovering a single device in RAID6.
From: Shaohua Li @ 2017-04-10 17:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: Brad Campbell, Linux-RAID, Dan Williams
In-Reply-To: <87r31adyuj.fsf@notabene.neil.brown.name>

On Mon, Apr 03, 2017 at 12:11:32PM +1000, Neil Brown wrote:
> 
> When recoverying a single missing/failed device in a RAID6,
> those stripes where the Q block is on the missing device are
> handled a bit differently.  In these cases it is easy to
> check that the P block is correct, so we do.  This results
> in the P block be destroy.  Consequently the P block needs
> to be read a second time in order to compute Q.  This causes
> lots of seeks and hurts performance.
> 
> It shouldn't be necessary to re-read P as it can be computed
> from the DATA.  But we only compute blocks on missing
> devices, since c337869d9501 ("md: do not compute parity
> unless it is on a failed drive").
> 
> So relax the change made in that commit to allow computing
> of the P block in a RAID6 which it is the only missing that
> block.
> 
> This makes RAID6 recovery run much faster as the disk just
> "before" the recovering device is no longer seeking
> back-and-forth.
> 
> Reported-by-tested-by: Brad Campbell <lists2009@fnarfbargle.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: NeilBrown <neilb@suse.com>

Applied, thanks, very interesting analysis!

> ---
>  drivers/md/raid5.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c523fd69a7bc..aeb2e236a247 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3617,9 +3617,20 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
>  		BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
>  		BUG_ON(test_bit(R5_Wantread, &dev->flags));
>  		BUG_ON(sh->batch_head);
> +
> +		/*
> +		 * In the raid6 case if the only non-uptodate disk is P
> +		 * then we already trusted P to compute the other failed
> +		 * drives. It is safe to compute rather than re-read P.
> +		 * In other cases we only compute blocks from failed
> +		 * devices, otherwise check/repair might fail to detect
> +		 * a real inconsistency.
> +		 */
> +
>  		if ((s->uptodate == disks - 1) &&
> +		    ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
>  		    (s->failed && (disk_idx == s->failed_num[0] ||
> -				   disk_idx == s->failed_num[1]))) {
> +				   disk_idx == s->failed_num[1])))) {
>  			/* have disk failed, and we're requested to fetch it;
>  			 * do compute it
>  			 */
> -- 
> 2.12.0
> 



^ permalink raw reply

* Re: [PATCH V2] md/raid10: reset the 'first' at the end of loop
From: Shaohua Li @ 2017-04-10 17:41 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb
In-Reply-To: <1491441138-16155-1-git-send-email-gqjiang@suse.com>

On Thu, Apr 06, 2017 at 09:12:18AM +0800, Guoqing Jiang wrote:
> We need to set "first = 0' at the end of rdev_for_each
> loop, so we can get the array's min_offset_diff correctly
> otherwise min_offset_diff just means the last rdev's
> offset diff.
> 
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>

applied, thanks!

> ---
>  drivers/md/raid10.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 0f13d57..e055ec9 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3769,6 +3769,7 @@ static int raid10_run(struct mddev *mddev)
>  
>  		if (blk_queue_discard(bdev_get_queue(rdev->bdev)))
>  			discard_supported = true;
> +		first = 0;
>  	}
>  
>  	if (mddev->queue) {
> @@ -4172,6 +4173,7 @@ static int raid10_start_reshape(struct mddev *mddev)
>  			if (first || diff < min_offset_diff)
>  				min_offset_diff = diff;
>  		}
> +		first = 0;
>  	}
>  
>  	if (max(before_length, after_length) > min_offset_diff)
> -- 
> 2.6.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] md:MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop
From: Shaohua Li @ 2017-04-10 17:47 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: shli, linux-raid, NeilBrown
In-Reply-To: <1491448593-12938-1-git-send-email-zlliu@suse.com>

On Thu, Apr 06, 2017 at 11:16:33AM +0800, Zhilong Liu wrote:
> From: NeilBrown <neilb@suse.com>
> 
> if called md_set_readonly and set MD_CLOSING bit, the mddev cannot
> be opened any more due to the MD_CLOING bit wasn't cleared. Thus it
> needs to be cleared in md_ioctl after any call to md_set_readonly()
> or do_md_stop().
> Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>

thanks, applied! This one looks stable stuff too.

> ---
> 
>  drivers/md/md.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d6..906a4bf 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6776,6 +6776,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>  	void __user *argp = (void __user *)arg;
>  	struct mddev *mddev = NULL;
>  	int ro;
> +	bool did_set_md_closing = false;
>  
>  	if (!md_ioctl_valid(cmd))
>  		return -ENOTTY;
> @@ -6865,7 +6866,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>  			err = -EBUSY;
>  			goto out;
>  		}
> +		WARN_ON_ONCE(test_bit(MD_CLOSING, &mddev->flags));
>  		set_bit(MD_CLOSING, &mddev->flags);
> +		did_set_md_closing = true;
>  		mutex_unlock(&mddev->open_mutex);
>  		sync_blockdev(bdev);
>  	}
> @@ -7058,6 +7061,8 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
>  		mddev->hold_active = 0;
>  	mddev_unlock(mddev);
>  out:
> +	if(did_set_md_closing)
> +		clear_bit(MD_CLOSING, &mddev->flags);
>  	return err;
>  }
>  #ifdef CONFIG_COMPAT
> -- 
> 2.6.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] md.c:didn't unlock the mddev before return EINVAL in array_size_store
From: Shaohua Li @ 2017-04-10 17:49 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: shli, linux-raid
In-Reply-To: <1491804955-7548-1-git-send-email-zlliu@suse.com>

On Mon, Apr 10, 2017 at 02:15:55PM +0800, Zhilong Liu wrote:
> md.c: it needs to release the mddev lock before
> the array_size_store() returns.

applied, thanks!
 
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>  drivers/md/md.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6ae1d6..5327236 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4843,8 +4843,10 @@ array_size_store(struct mddev *mddev, const char *buf, size_t len)
>  		return err;
>  
>  	/* cluster raid doesn't support change array_sectors */
> -	if (mddev_is_clustered(mddev))
> +	if (mddev_is_clustered(mddev)) {
> +		mddev_unlock(mddev);
>  		return -EINVAL;
> +	}
>  
>  	if (strncmp(buf, "default", 7) == 0) {
>  		if (mddev->pers)
> -- 
> 2.6.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] md/raid1: avoid reusing a resync bio after error handling.
From: Shaohua Li @ 2017-04-10 18:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-RAID, Michael Wang
In-Reply-To: <87bmsaguhe.fsf@notabene.neil.brown.name>

On Thu, Apr 06, 2017 at 12:06:37PM +1000, Neil Brown wrote:
> 
> fix_sync_read_error() modifies a bio on a newly faulty
> device by setting bi_end_io to end_sync_write.
> This ensure that put_buf() will still call rdev_dec_pending()
> as required, but makes sure that subsequent code in
> fix_sync_read_error() doesn't try to read from the device.
> 
> Unfortunately this interacts badly with sync_request_write()
> which assumes that any bio with bi_end_io set to non-NULL
> other than end_sync_read is safe to write to.
> 
> As the device is now faulty it doesn't make sense to write.
> As the bio was recently used for a read, it is "dirty"
> and not suitable for immediate submission.
> In particular, ->bi_next might be non-NULL, which will cause
> generic_make_request() to complain.
> 
> Break this interaction by refusing to write to devices
> which are marked as Faulty.
> 
> Reported-and-tested-by: Michael Wang <yun.wang@profitbricks.com>
> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> Cc: stable@vger.kernel.org (v4.10+)
> Signed-off-by: NeilBrown <neilb@suse.com>

Thanks, applied!

> ---
>  drivers/md/raid1.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index a70283753a35..9c1b2231d2db 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2185,6 +2185,8 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
>  		     (i == r1_bio->read_disk ||
>  		      !test_bit(MD_RECOVERY_SYNC, &mddev->recovery))))
>  			continue;
> +		if (test_bit(Faulty, &conf->mirrors[i].rdev->flags))
> +			continue;
>  
>  		bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
>  		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
> -- 
> 2.12.2
> 



^ permalink raw reply

* Re: [PATCH 1/4] raid5-ppl: use a single mempool for ppl_io_unit and header_page
From: Shaohua Li @ 2017-04-10 19:09 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, songliubraving
In-Reply-To: <20170404111358.14829-2-artur.paszkiewicz@intel.com>


On Tue, Apr 04, 2017 at 01:13:55PM +0200, Artur Paszkiewicz wrote:

Cc: Song, the raid5-cache needs similar fix.

> Allocate both struct ppl_io_unit and its header_page from a shared
> mempool to avoid a possible deadlock. Implement allocate and free
> functions for the mempool, remove the second pool for allocating
> header_page. The header_pages are now freed with their io_units, not
> when the ppl bio completes. Also, use GFP_NOWAIT instead of GFP_ATOMIC
> for allocating ppl_io_unit because we can handle failed allocations and
> there is no reason to utilize emergency reserves.

I applied the last 3 patches, had some nitpicks for this one though

> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5-ppl.c | 53 ++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 36 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index 86ea9addb51a..42e43467d1e8 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -102,7 +102,6 @@ struct ppl_conf {
>  	struct kmem_cache *io_kc;
>  	mempool_t *io_pool;
>  	struct bio_set *bs;
> -	mempool_t *meta_pool;
>  
>  	/* used only for recovery */
>  	int recovered_entries;
> @@ -195,6 +194,33 @@ ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
>  	return tx;
>  }
>  
> +static void *ppl_io_pool_alloc(gfp_t gfp_mask, void *pool_data)
> +{
> +	struct kmem_cache *kc = pool_data;
> +	struct ppl_io_unit *io;
> +
> +	io = kmem_cache_alloc(kc, gfp_mask);
> +	if (!io)
> +		return NULL;
> +
> +	io->header_page = alloc_page(gfp_mask);
> +	if (!io->header_page) {
> +		kmem_cache_free(kc, io);
> +		return NULL;
> +	}
> +
> +	return io;

Maybe directly use GFP_NOWAIT here, we don't use other gfp

> +}
> +
> +static void ppl_io_pool_free(void *element, void *pool_data)
> +{
> +	struct kmem_cache *kc = pool_data;
> +	struct ppl_io_unit *io = element;
> +
> +	__free_page(io->header_page);
> +	kmem_cache_free(kc, io);
> +}
> +
>  static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>  					  struct stripe_head *sh)
>  {
> @@ -202,18 +228,19 @@ static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>  	struct ppl_io_unit *io;
>  	struct ppl_header *pplhdr;
>  
> -	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> +	io = mempool_alloc(ppl_conf->io_pool, GFP_NOWAIT);
>  	if (!io)
>  		return NULL;
>  
> -	memset(io, 0, sizeof(*io));
>  	io->log = log;
> +	io->entries_count = 0;
> +	io->pp_size = 0;
> +	io->submitted = false;

I'd suggest moving the memset to ppl_io_pool_alloc. Don't think we need to
optimize to avoid setting header_page. And doing memset is less error prone,
for example adding new fields.

Otherwise looks quite good.

Thanks,
Shaohua
>  	INIT_LIST_HEAD(&io->log_sibling);
>  	INIT_LIST_HEAD(&io->stripe_list);
>  	atomic_set(&io->pending_stripes, 0);
>  	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
>  
> -	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
>  	pplhdr = page_address(io->header_page);
>  	clear_page(pplhdr);
>  	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
> @@ -369,8 +396,6 @@ static void ppl_log_endio(struct bio *bio)
>  	if (bio->bi_error)
>  		md_error(ppl_conf->mddev, log->rdev);
>  
> -	mempool_free(io->header_page, ppl_conf->meta_pool);
> -
>  	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
>  		list_del_init(&sh->log_list);
>  
> @@ -998,7 +1023,6 @@ static void __ppl_exit_log(struct ppl_conf *ppl_conf)
>  
>  	kfree(ppl_conf->child_logs);
>  
> -	mempool_destroy(ppl_conf->meta_pool);
>  	if (ppl_conf->bs)
>  		bioset_free(ppl_conf->bs);
>  	mempool_destroy(ppl_conf->io_pool);
> @@ -1104,25 +1128,20 @@ int ppl_init_log(struct r5conf *conf)
>  
>  	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
>  	if (!ppl_conf->io_kc) {
> -		ret = -EINVAL;
> +		ret = -ENOMEM;
>  		goto err;
>  	}
>  
> -	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
> +	ppl_conf->io_pool = mempool_create(conf->raid_disks, ppl_io_pool_alloc,
> +					   ppl_io_pool_free, ppl_conf->io_kc);
>  	if (!ppl_conf->io_pool) {
> -		ret = -EINVAL;
> +		ret = -ENOMEM;
>  		goto err;
>  	}
>  
>  	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
>  	if (!ppl_conf->bs) {
> -		ret = -EINVAL;
> -		goto err;
> -	}
> -
> -	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
> -	if (!ppl_conf->meta_pool) {
> -		ret = -EINVAL;
> +		ret = -ENOMEM;
>  		goto err;
>  	}
>  
> -- 
> 2.11.0
> 

^ permalink raw reply

* Re: [PATCH 03/27] block: implement splitting of REQ_OP_WRITE_ZEROES bios
From: Anthony Youngman @ 2017-04-10 19:40 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170405142205.6477-4-hch@lst.de>

s/past/paste/

On 05/04/17 15:21, Christoph Hellwig wrote:
> Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
> that limit the write zeroes size.

Cheers,
Wol

^ permalink raw reply

* Re: 4.10 + 765d704db: no improvemtn in write rates with md/raid5 group_thread_cnt > 0
From: Shaohua Li @ 2017-04-10 20:10 UTC (permalink / raw)
  To: Nix; +Cc: linux-raid, Shaohua Li
In-Reply-To: <87zifvj61v.fsf@esperi.org.uk>

On Wed, Apr 05, 2017 at 03:13:48PM +0100, Nix wrote:
> So you'd expect write rates on a RAID-5 array to be higher than write rates on a
> single spinning-rust disk, right? Because, even with Shaohua's commit
> 765d704db1f583630d52 applied atop 4.10, I see little sign of it. Does this
> commit depend upon something else to stop death by seeking with
> group_thread_cnt > 0? It didn't look like it to me...
> 
> The results Shaohua showed in the original commit were very impressive, but for
> the life of me I can't figure out how to get anything like them.

That only works well with large iodepth. For single write, we are still far
from the BW in theory. I actually wrote in the commit log:

"We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline."

Thanks,
Shaohua
 
> 
> With group_thread_cnt 0, I max out at a bit higher than the 240MiB/s one disk in
> this array can manage on its own, for obvious reasons: md_raid5 CPU saturation.
> (This is with a 512KiB chunksize, stripe_cache_size of 512: yes, I know that's
> small, it's just a random slice taken out of a much larger test series: the
> array is a smallish non-degraded unjournalled four-element md5 initialized with
> --assume-clean for benchmarking). Similar results are seen with ext4 and xfs.
> Trimmed-down iozone -a output, so only one serial writer, but still:
> 
>                                                        stride
>      kB  reclen    write  rewrite    read    reread      read
>      64       4     6752    15647    26489    30145     26678
>      64       8     6639    25236    45101    56289     43158
>      64      16     6014     9799    67364    89009     60900
>      64      32    35200    48781     7374   177207      7336
>      64      64    32420    70551   109395   229470     97868
> [...]
>   32768      64    28181    30576   265403   178438    299889
>   32768     128    41659    39989   319709   320689    330949
>   32768     256    45402    44555   320689   357564    451256
>   32768     512    42559    40556   177862   299744    466529
>   32768    1024    68005    52814   415747   391507    706177
>   32768    2048    91701   103918   520689   540128   1061339
>   32768    4096   177716   169486   487277   514111    683463
>   32768    8192   218923   233152   539853   616869    453021
>   32768   16384   199068   198872   569353   619913    535240
> [...]
>  262144      64    25148    32423   385802   378681     27762
>  262144     128    42510    41626   436994   380669     48004
>  262144     256    43415    44004   436209   418971     76697
>  262144     512    41408    40399   342862   401145    116781
>  262144    1024    68870    59341   465737   507454    265154
>  262144    2048   101994    91693   589277   582836    296474
>  262144    4096   176852   166200   581922   649215    421253
>  262144    8192   226696   221838   601174   633347    569766
>  262144   16384   307843   297985   644679   659060    569302
>  524288      64    25155    24527   392401   401908     21461
>  524288     128    41422    41525   433156   464331     35360
>  524288     256    42059    43742   443281   415799     70171
>  524288     512    41253    39360   414306   428993     75387
>  524288    1024    66081    61151   498880   517952    186959
>  524288    2048   101272    90418   610467   623258    274331
>  524288    4096   171489   173381   601689   576333    314290
>  524288    8192   220943   215226   641713   607459    444827
>  524288   16384   289055   296340   651010   671623    503633
> 
> Read rates are as high as I'd expect for a four-disk RAID-5 array, and the
> sequential write output rates, while higher than one one disk can manage, are
> thresholded here by the performance of the md I/O thread, as expected.
> 
> If I boost group_thread_cnt to, say, 2, I see:
> 
>      64       4     3677    14565    27936    36056     29629
>      64       8     6670    21608    53422    69187     32045
>      64      16     6682    26209    70329   103891     66662
>      64      32    28624    40048     7312   154556      7345
>      64      64    38327    43213    89127   260160     90540
> [...]
>   32768      64    14328    18580   265136   282946    308082
>   32768     128    26310    24803   265762   323414    354685
>   32768     256    29115    27073   238659   308974    345723
>   32768     512    21572    21345   293312   314086    345365
>   32768    1024    43978    38071   395715   345161    545821
>   32768    2048    82898    70840   293151   470398    922082
>   32768    4096   143350   124658   391980   659819    617984
>   32768    8192   164297   227661   570423   645141    515009
>   32768   16384   157701   171804   568484   451448    350715
> [...]
>  262144      64    17150    17693   391561   382751     28374
>  262144     128    25385    26498   423685   410359     47148
>  262144     256    29219    30244   392992   421748     80403
>  262144     512    24303    24686   399453   371882    122861
>  262144    1024    42296    42535   403020   508195    261339
>  262144    2048    75740    63125   606979   589329    296124
>  262144    4096   134646   137543   562749   590893    392938
>  262144    8192   237800   239847   631752   620766    475791
>  262144   16384   267889   304517   635674   612164    598521
>  524288      64    17691    17776   403333   374628     21673
>  524288     128    25575    25609   396568   439018     34526
>  524288     256    29984    29990   412587   437099     71650
>  524288     512    24971    25599   403074   431581     75471
>  524288    1024    42545    43657   505740   519112    200811
>  524288    2048    72519    75604   559987   589069    257654
>  524288    4096   135122   140745   622450   499336    331273
>  524288    8192   232848   231307   592729   604849    432296
>  524288   16384   280105   271252   647725   664868    472363
> 
> Larger writes are clearly still thresholded.
> 
> Boost the thread count more, here, to 8:
> 
>      64       4     7834    14388    30346    40300      6124
>      64       8    17236    21282     6984    37644      6842
>      64      16    21100    24720     7208   120277      7199
>      64      32    29411    45553     7374   162411      7357
>      64      64     3671    59588    78128   256923     82804
> [...]
>   32768      64    14261    17866   261303   289135    294245
>   32768     128    25832    27639   298172   324766    342822
>   32768     256    26477    27196   277318   339353    352967
>   32768     512    17848    19875   339424   272225    387746
>   32768    1024    36017    38945   482068   464194    110825
>   32768    2048    64240    67976   551762   505772     76629
>   32768    4096    71022   117680   578561   696507    752493
>   32768    8192   161080   207790   564343   556796    546488
>   32768   16384   172937   233103   521368   603562    418679
> [...]
>  262144      64    17170    17452   352337   351258     27824
>  262144     128    25318    25522   418977   424859     47112
>  262144     256    26405    27092   426170   419684     79047
>  262144     512    20185    20271   398733   411974    135554
>  262144    1024    39013    38238   497919   438150    180384
>  262144    2048    71054    70921   586634   535676    258955
>  262144    4096   113222   121554   616548   604177    293088
>  262144    8192   184086   187845   551395   586126    496147
>  262144   16384   286319   272419   645900   659103    589384
>  524288      64    16980    16756   385746   381476     21462
>  524288     128    24993    25482   428855   438250     34889
>  524288     256    26517    26134   448088   395352     70225
>  524288     512    19534    19484   418764   416630     76975
>  524288    1024    37645    38370   514030   511638    177818
>  524288    2048    68469    72200   602688   542627    251162
>  524288    4096   115467   121220   598738   629120    289589
>  524288    8192   185093   182044   621233   586919    437162
>  524288   16384   250990   266257   620428   660663    494770
> 
> Still thresholded. Yes, this is only one serial writer, but nonetheless this
> seems a bit od.
> 
> -- 
> NULL && (void)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Shrinking an array
From: Wakko Warner @ 2017-04-11  0:30 UTC (permalink / raw)
  To: linux-raid

I have a question about shrinking an array.  My current array is 4x 2tb
disks in raid6 (md0).  The array was created on the 2nd partition of each
disk and spans most of the disk.  I would like to replace the 2tb disks with
750gb disks.  md0 is a luks container with lvm underneath.  I have less than
1tb actually in use.  What would the recommended procedure be for shrinking
this?  I've watched this list, but I don't think I've come across anyone
actually wanting to do this before.
I'm thinking of these steps already:
1) Shrink PV.
2) Shrink luks.  I'm aware that there is not size metadata, but the dm
mapping would need to be shrunk.
3) Shrink md0.  I did this once when I changed a 6 drive raid6 into a 5
drive raid6.  Would I use --array-size= or --size= ?  I understand the
difference is the size of md0 vs the individual members.

So for number 4, if md0 is now small enough, will it accept a member that is
smaller?  If so, I should beable to add the member to the array and issue
--replace.

Thanks.

-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox