Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [mdadm PATCH] Retry HOT_REMOVE_DISK a few times.
From: NeilBrown @ 2017-03-27  1:50 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 3612 bytes --]


HOT_REMOVE_DISK can fail with EBUSY if there are outstanding
IO request that have not completed yet.  It can sometimes
be helpful to wait a little while for these to complete.

We already do this in impose_level() when reshaping a device,
but not in Manage.c in response to an explicit --remove request.

So create hot_remove_disk() to central this code, and call it
where-ever it makes sense to wait for a HOT_REMOVE_DISK to succeed.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 Grow.c   |  9 +--------
 Manage.c |  4 ++--
 mdadm.h  |  1 +
 util.c   | 18 ++++++++++++++++++
 4 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/Grow.c b/Grow.c
index 455c5f90bf58..218a70649c1f 100755
--- a/Grow.c
+++ b/Grow.c
@@ -2736,7 +2736,6 @@ static int impose_level(int fd, int level, char *devname, int verbose)
 		for (d = 0, found = 0;
 		     d < MAX_DISKS && found < array.nr_disks;
 		     d++) {
-			int cnt;
 			mdu_disk_info_t disk;
 			disk.number = d;
 			if (ioctl(fd, GET_DISK_INFO, &disk) < 0)
@@ -2750,13 +2749,7 @@ static int impose_level(int fd, int level, char *devname, int verbose)
 				continue;
 			ioctl(fd, SET_DISK_FAULTY,
 			      makedev(disk.major, disk.minor));
-			cnt = 5;
-			while (ioctl(fd, HOT_REMOVE_DISK,
-				     makedev(disk.major, disk.minor)) < 0
-			       && errno == EBUSY
-			       && cnt--) {
-				usleep(10000);
-			}
+			hot_remove_disk(fd, makedev(disk.major, disk.minor));
 		}
 	}
 	c = map_num(pers, level);
diff --git a/Manage.c b/Manage.c
index 5c3d2b9b1a9f..9139f96e1431 100644
--- a/Manage.c
+++ b/Manage.c
@@ -1183,7 +1183,7 @@ int Manage_remove(struct supertype *tst, int fd, struct mddev_dev *dv,
 		else
 			err = 0;
 	} else {
-		err = ioctl(fd, HOT_REMOVE_DISK, rdev);
+		err = hot_remove_disk(fd, rdev);
 		if (err && errno == ENODEV) {
 			/* Old kernels rejected this if no personality
 			 * is registered */
@@ -1607,7 +1607,7 @@ int Manage_subdevs(char *devname, int fd,
 
 			if (dv->disposition == 'F')
 				/* Need to remove first */
-				ioctl(fd, HOT_REMOVE_DISK, rdev);
+				hot_remove_disk(fd, rdev);
 			/* Make sure it isn't in use (in 2.6 or later) */
 			tfd = dev_open(dv->devname, O_RDONLY|O_EXCL);
 			if (tfd >= 0) {
diff --git a/mdadm.h b/mdadm.h
index 71b8afb9fee6..53b3b5836841 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1476,6 +1476,7 @@ extern int add_disk(int mdfd, struct supertype *st,
 		    struct mdinfo *sra, struct mdinfo *info);
 extern int remove_disk(int mdfd, struct supertype *st,
 		       struct mdinfo *sra, struct mdinfo *info);
+extern int hot_remove_disk(int mdfd, unsigned long dev);
 extern int set_array_info(int mdfd, struct supertype *st, struct mdinfo *info);
 unsigned long long min_recovery_start(struct mdinfo *array);
 
diff --git a/util.c b/util.c
index f1009723ed04..e176a9d466cf 100644
--- a/util.c
+++ b/util.c
@@ -1795,6 +1795,24 @@ int remove_disk(int mdfd, struct supertype *st,
 	return rv;
 }
 
+int hot_remove_disk(int mdfd, unsigned long dev)
+{
+	int cnt = 5;
+	int ret;
+
+	/* HOT_REMOVE_DISK can fail with EBUSY if there are
+	 * outstanding IO requests to the device.
+	 * In this case, it can be helpful to wait a little while,
+	 * up to half a second, for that IO to flush.
+	 */
+	while ((ret = ioctl(mdfd, HOT_REMOVE_DISK, dev)) == -1 &&
+	       errno == EBUSY &&
+	       cnt-- > 0)
+		usleep(10000);
+
+	return ret;
+}
+
 int set_array_info(int mdfd, struct supertype *st, struct mdinfo *info)
 {
 	/* Initialise kernel's knowledge of array.
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* [mdadm PATCH] udev-md-raid-assembly.rules: Skip non-ready devices
From: NeilBrown @ 2017-03-27  0:15 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Linux-RAID, Hannes Reinecke

[-- Attachment #1: Type: text/plain, Size: 1126 bytes --]


From: Hannes Reinecke <hare@suse.de>

If a device isn't fully initialized (e.g if it should be
handled by multipathing) it should not be considered for
md/RAID auto-assembly.  Doing so can cause incorrect results
such as causing multipath to fail during startup.

There is a convention that the udev environment variable
SYSTEMD_READY be set to zero for such devices.  So change
the mdadm rules to ignore devices with SYSTEMD_READY==0.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 udev-md-raid-assembly.rules | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/udev-md-raid-assembly.rules b/udev-md-raid-assembly.rules
index d0d440a6394c..8ca232a44b1f 100644
--- a/udev-md-raid-assembly.rules
+++ b/udev-md-raid-assembly.rules
@@ -7,6 +7,9 @@ ENV{ANACONDA}=="?*", GOTO="md_inc_end"
 
 SUBSYSTEM!="block", GOTO="md_inc_end"
 
+# skip non-initialized devices
+ENV{SYSTEMD_READY}=="0", GOTO="md_inc_end"
+
 # handle potential components of arrays (the ones supported by md)
 ENV{ID_FS_TYPE}=="linux_raid_member", GOTO="md_inc"
 
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH v3] block: trace completion of all bios.
From: NeilBrown @ 2017-03-26 23:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	open list:DEVICE-MAPPER (LVM), Alasdair Kergon, Mike Snitzer,
	Shaohua Li, Linux Kernel Mailing List, Martin K . Petersen
In-Reply-To: <CACVXFVNC=-_f1jYo5bYw7X6TT8-g7UThw3zzzfrMaKVarnJvSQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4825 bytes --]

On Fri, Mar 24 2017, Ming Lei wrote:

> On Fri, Mar 24, 2017 at 8:07 AM, NeilBrown <neilb@suse.com> wrote:
...
>> @@ -102,6 +102,8 @@ struct bio {
>>  #define BIO_REFFED     8       /* bio has elevated ->bi_cnt */
>>  #define BIO_THROTTLED  9       /* This bio has already been subjected to
>>                                  * throttling rules. Don't do it again. */
>> +#define BIO_TRACE_COMPLETION 10        /* bio_endio() should trace the final completion
>> +                                * of this bio. */
>
> This may not be a good idea, since the flag space is quite small(12).

which means there are 2 unused bits and I only want to use 1.  It should
fit.
I'm a bit perplexed why 4 bits a reserved for the BVEC_POOL number which
ranges from 0 to 7....

But if these bits really are a scarce resource, we should stop wasting
them.
The patch below makes bit 7 (BIO_CHAIN) available.  We could probably make
bit 8 (BIO_REFFED) available using a similar technique.
BIO_QUIET is only relevant if bi_error is non-zero and bi_error has a
limited range, so a bit in there could be used instead. In fact,
MAX_ERRNO is 4096, so bi_error could be a 'short'.  That could save 2
whole bytes ... or make 16 more flag bits available.

So I don't think you concern about a shortage of spare flag bits is valid.

Thanks,
NeilBrown

diff --git a/block/bio.c b/block/bio.c
index c1272986133e..1703786a206a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -274,7 +274,7 @@ void bio_init(struct bio *bio, struct bio_vec *table,
 	      unsigned short max_vecs)
 {
 	memset(bio, 0, sizeof(*bio));
-	atomic_set(&bio->__bi_remaining, 1);
+	atomic_set(&bio->__bi_remaining, 0);
 	atomic_set(&bio->__bi_cnt, 1);
 
 	bio->bi_io_vec = table;
@@ -300,7 +300,7 @@ void bio_reset(struct bio *bio)
 
 	memset(bio, 0, BIO_RESET_BYTES);
 	bio->bi_flags = flags;
-	atomic_set(&bio->__bi_remaining, 1);
+	atomic_set(&bio->__bi_remaining, 0);
 }
 EXPORT_SYMBOL(bio_reset);
 
@@ -1794,20 +1794,15 @@ EXPORT_SYMBOL(bio_flush_dcache_pages);
 static inline bool bio_remaining_done(struct bio *bio)
 {
 	/*
-	 * If we're not chaining, then ->__bi_remaining is always 1 and
+	 * If we're not chaining, then ->__bi_remaining is always 0 and
 	 * we always end io on the first invocation.
 	 */
-	if (!bio_flagged(bio, BIO_CHAIN))
+	if (atomic_read(&bio->__bi_remaining) == 0)
 		return true;
 
 	BUG_ON(atomic_read(&bio->__bi_remaining) <= 0);
 
-	if (atomic_dec_and_test(&bio->__bi_remaining)) {
-		bio_clear_flag(bio, BIO_CHAIN);
-		return true;
-	}
-
-	return false;
+	return atomic_dec_and_test(&bio->__bi_remaining);
 }
 
 /**
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..d15245544111 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -664,13 +664,19 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 }
 
 /*
- * Increment chain count for the bio. Make sure the CHAIN flag update
- * is visible before the raised count.
+ * Increment chain count for the bio.
  */
 static inline void bio_inc_remaining(struct bio *bio)
 {
-	bio_set_flag(bio, BIO_CHAIN);
-	smp_mb__before_atomic();
+	/*
+	 * Calls to bio_inc_remaining() cannot race
+	 * with the final call to bio_end_io(), and
+	 * the first call cannot race with other calls,
+	 * so if __bi_remaining appears to be zero, there
+	 * can be no race which might change it.
+	 */
+	if (atomic_read(&bio->__bi_remaining) == 0)
+		atomic_set(&bio->__bi_remaining, 1);
 	atomic_inc(&bio->__bi_remaining);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db7a57ee0e58..2deea4501d14 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -46,6 +46,15 @@ struct bio {
 	unsigned int		bi_seg_front_size;
 	unsigned int		bi_seg_back_size;
 
+	/* __bi_remaining records the number of completion events
+	 * (i.e. calls to bi_end_io()) that need to happen before this
+	 * bio is truly complete.
+	 * A value of '0' means that there will only ever be one completion
+	 * event, so there will be no racing and no need for an atomic operation
+	 * to detect the last event.
+	 * Any other value represents a simple count subject to atomic_inc() and
+	 * atomic_dec_and_test().
+	 */
 	atomic_t		__bi_remaining;
 
 	bio_end_io_t		*bi_end_io;
@@ -98,7 +107,7 @@ struct bio {
 #define BIO_USER_MAPPED 4	/* contains user pages */
 #define BIO_NULL_MAPPED 5	/* contains invalid user pages */
 #define BIO_QUIET	6	/* Make BIO Quiet */
-#define BIO_CHAIN	7	/* chained bio, ->bi_remaining in effect */
+/* 7 unused */
 #define BIO_REFFED	8	/* bio has elevated ->bi_cnt */
 #define BIO_THROTTLED	9	/* This bio has already been subjected to
 				 * throttling rules. Don't do it again. */

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: constant array_state active after specific jobs
From: NeilBrown @ 2017-03-26 22:42 UTC (permalink / raw)
  To: pdi; +Cc: linux-raid, Shaohua Li
In-Reply-To: <20170324090442.3d01347b@hal9.pdi.lan>

[-- Attachment #1: Type: text/plain, Size: 3594 bytes --]

On Fri, Mar 24 2017, pdi wrote:

> On Fri, 24 Mar 2017 16:25:35 +1100
> NeilBrown <neilb@suse.com> wrote:
>
>> On Thu, Mar 23 2017, pdi wrote:
>> 
>> > Greetings all,
>> >
>> > The problem in a nutshell is that an array is clean after boot,
>> > until some specific jobs switch it to active where it remains until
>> > reboot.
>> >
>> > A similar problem was discussed, and solved, in 
>> > https://www.spinics.net/lists/raid/msg46450.html. However, AFAICT,
>> > it is not the same issue.
>> >
>> > I would be grateful for any insights as to why this happens and/or
>> > how to prevent it.
>> >
>> > The relevant info follows, please let me know if anything further
>> > might help.
>> >
>> > Many thanks in advance.
>> >
>> > - uname -a
>> >   Linux hostname 4.4.38 #1 SMP Sun Dec 11 16:03:41 CST 2016 x86_64
>> >   Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz GenuineIntel GNU/Linux
>> > - mdadm -V
>> >   mdadm - v3.3.4 - 3rd August 2015
>> > - Desktop drives without sct/erc,
>> >   with timeout mismatch correction as per
>> >   https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>> > - /dev/md9 is a raid10 array, 4 devices, far=2,
>> >   with various dirs used as samba and nfs shares
>> > - The array is in *constant* array_state active
>> > - mdadm -D /dev/md9 | grep 'State :'
>> >   State : active
>> > - cat /sys/block/md9/md/array_state
>> >   active
>> > - watch -d 'grep md9 /proc/diskstats'
>> >   remain unchanged
>> > - uptime
>> >   load average: 0.00, 0.00, 0.00
>> > - cat /sys/block/md9/md/safe_mode_delay
>> >   0.201
>> > - echo 0.1 > /sys/block/md9/md/safe_mode_delay
>> >   array_state remains active
>> > - echo clean > /sys/block/md9/md/array_state
>> >   echo: write error: Device or resource busy
>> > - reboot (with or without prior check)
>> >   array_state clean
>> > - After reboot, array remains clean until some specific
>> >   jobs put it in constant active state. Such jobs so far
>> >   identified:
>> >   - echo check > /sys/block/md9/md/sync_action
>> >   - run an rsnapshot job
>> >   - start a qemu/kvm vm
>> > - Other jobs, like text/doc editing, multimedia playback,
>> >   etc retain array_state clean  
>> 
>> This bug was introduced by
>> Commit: 20d0189b1012 ("block: Introduce new bio_split()")
>> in 3.14, and fixed by
>> Commit: 9b622e2bbcf0 ("raid10: increment write counter after bio is
>> split") in 4.8.
>> 
>> Maybe the latter patch should be sent to -stable ??
>> 
>> NeilBrown
>
> NeilBrown, thank you for your swift and concise answer.
>
> I gather you are referring to kernel version numbers. The described
> behaviour was first noticed many months ago with kernel 2.6.37.6, and
> persisted after a system upgrade and kernel 4.4.38. However, after the
> upgrade two things were corrected, the timeout mismatch, and a
> Current_Pending_Sector in one of the drives; which may, or may not,
> explain the occurrence with the older kernel.
>
> Is this constant active state in the data array something to worry about
> and try kernel >= 4.8, or shall I let be?

The only important consequence of the constant active state is that if
your machine crashes at a moment when the array would otherwise have
been idle, then a resync will be needed after reboot.  Without the
constant active state, that resync would not have been needed.

If you have a write-intent bitmap, this is not particularly relevant.

I cannot say how important it is to you to avoid a resync after a crash,
so I don't know if you should just let it be or not.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] md/raid1: skip data copy for behind io for discard request
From: Ming Lei @ 2017-03-25 16:30 UTC (permalink / raw)
  To: Shaohua Li; +Cc: open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <7dd6d110d2eeafaf2980f08e6e0b318be654f72a.1490393962.git.shli@fb.com>

On Sat, Mar 25, 2017 at 6:20 AM, Shaohua Li <shli@fb.com> wrote:
> discard request doesn't have data attached, so it's meaningless to
> allocate memory and copy from original bio for behind IO. And the copy
> is bogus because bio_copy_data_partial can't handle discard request.
>
> We don't support writesame/writezeros request so far.
>
> Cc: Ming Lei <tom.leiming@gmail.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

Another way is to not do write behind for discard, but both should be fine, so:

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

> ---
>  drivers/md/raid1.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c6a671f..b7d9651 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1102,6 +1102,10 @@ static struct bio *alloc_behind_master_bio(struct r1bio *r1_bio,
>         if (!behind_bio)
>                 goto fail;
>
> +       /* discard op, we don't support writezero/writesame yet */
> +       if (!bio_has_data(bio))
> +               goto skip_copy;
> +
>         while (i < vcnt && size) {
>                 struct page *page;
>                 int len = min_t(int, PAGE_SIZE, size);
> @@ -1118,7 +1122,7 @@ static struct bio *alloc_behind_master_bio(struct r1bio *r1_bio,
>
>         bio_copy_data_partial(behind_bio, bio, offset,
>                               behind_bio->bi_iter.bi_size);
> -
> +skip_copy:
>         r1_bio->behind_master_bio = behind_bio;;
>         set_bit(R1BIO_BehindIO, &r1_bio->state);
>
> --
> 2.9.3
>



-- 
Ming Lei

^ permalink raw reply

* Re: [PATCH] md/raid5-cache: fix payload endianness problem in raid5-cache
From: Shaohua Li @ 2017-03-25 16:27 UTC (permalink / raw)
  To: Jason Yan; +Cc: linux-raid, miaoxie, zhaohongjiang, songliubraving
In-Reply-To: <1490406279-20453-1-git-send-email-yanaijie@huawei.com>

On Sat, Mar 25, 2017 at 09:44:39AM +0800, Jason Yan wrote:
> The payload->header.type and payload->size are little-endian, so just
> convert them to the right byte order.
Cc Song.

Applied, thanks!

 
> Signed-off-by: Jason Yan <yanaijie@huawei.com>
> ---
>  drivers/md/raid5-cache.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 25eb048..b6194e0 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2002,12 +2002,12 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
>  		payload = (void *)mb + mb_offset;
>  		payload_flush = (void *)mb + mb_offset;
>  
> -		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
> +		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) {
>  			if (r5l_recovery_verify_data_checksum(
>  				    log, ctx, page, log_offset,
>  				    payload->checksum[0]) < 0)
>  				goto mismatch;
> -		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY) {
> +		} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_PARITY) {
>  			if (r5l_recovery_verify_data_checksum(
>  				    log, ctx, page, log_offset,
>  				    payload->checksum[0]) < 0)
> @@ -2019,12 +2019,12 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
>  						 BLOCK_SECTORS),
>  				    payload->checksum[1]) < 0)
>  				goto mismatch;
> -		} else if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
> +		} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
>  			/* nothing to do for R5LOG_PAYLOAD_FLUSH here */
>  		} else /* not R5LOG_PAYLOAD_DATA/PARITY/FLUSH */
>  			goto mismatch;
>  
> -		if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
> +		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
>  			mb_offset += sizeof(struct r5l_payload_flush) +
>  				le32_to_cpu(payload_flush->size);
>  		} else {
> @@ -2091,7 +2091,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
>  		payload = (void *)mb + mb_offset;
>  		payload_flush = (void *)mb + mb_offset;
>  
> -		if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
> +		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
>  			int i, count;
>  
>  			count = le32_to_cpu(payload_flush->size) / sizeof(__le64);
> @@ -2113,7 +2113,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
>  		}
>  
>  		/* DATA or PARITY payload */
> -		stripe_sect = (payload->header.type == R5LOG_PAYLOAD_DATA) ?
> +		stripe_sect = (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) ?
>  			raid5_compute_sector(
>  				conf, le64_to_cpu(payload->location), 0, &dd,
>  				NULL)
> @@ -2151,7 +2151,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
>  			list_add_tail(&sh->lru, cached_stripe_list);
>  		}
>  
> -		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
> +		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) {
>  			if (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
>  			    test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags)) {
>  				r5l_recovery_replay_one_stripe(conf, sh, ctx);
> @@ -2159,7 +2159,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
>  			}
>  			r5l_recovery_load_data(log, sh, ctx, payload,
>  					       log_offset);
> -		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY)
> +		} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_PARITY)
>  			r5l_recovery_load_parity(log, sh, ctx, payload,
>  						 log_offset);
>  		else
> @@ -2361,7 +2361,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
>  				payload = (void *)mb + offset;
>  				payload->header.type = cpu_to_le16(
>  					R5LOG_PAYLOAD_DATA);
> -				payload->size = BLOCK_SECTORS;
> +				payload->size = cpu_to_le32(BLOCK_SECTORS);
>  				payload->location = cpu_to_le64(
>  					raid5_compute_blocknr(sh, i, 0));
>  				addr = kmap_atomic(dev->page);
> -- 
> 2.5.0
> 

^ permalink raw reply

* [PATCH] md/raid5-cache: fix payload endianness problem in raid5-cache
From: Jason Yan @ 2017-03-25  1:44 UTC (permalink / raw)
  To: shli, linux-raid; +Cc: miaoxie, zhaohongjiang, Jason Yan

The payload->header.type and payload->size are little-endian, so just
convert them to the right byte order.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
---
 drivers/md/raid5-cache.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 25eb048..b6194e0 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2002,12 +2002,12 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
 		payload = (void *)mb + mb_offset;
 		payload_flush = (void *)mb + mb_offset;
 
-		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
+		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) {
 			if (r5l_recovery_verify_data_checksum(
 				    log, ctx, page, log_offset,
 				    payload->checksum[0]) < 0)
 				goto mismatch;
-		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY) {
+		} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_PARITY) {
 			if (r5l_recovery_verify_data_checksum(
 				    log, ctx, page, log_offset,
 				    payload->checksum[0]) < 0)
@@ -2019,12 +2019,12 @@ r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
 						 BLOCK_SECTORS),
 				    payload->checksum[1]) < 0)
 				goto mismatch;
-		} else if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
+		} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
 			/* nothing to do for R5LOG_PAYLOAD_FLUSH here */
 		} else /* not R5LOG_PAYLOAD_DATA/PARITY/FLUSH */
 			goto mismatch;
 
-		if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
+		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
 			mb_offset += sizeof(struct r5l_payload_flush) +
 				le32_to_cpu(payload_flush->size);
 		} else {
@@ -2091,7 +2091,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 		payload = (void *)mb + mb_offset;
 		payload_flush = (void *)mb + mb_offset;
 
-		if (payload->header.type == R5LOG_PAYLOAD_FLUSH) {
+		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_FLUSH) {
 			int i, count;
 
 			count = le32_to_cpu(payload_flush->size) / sizeof(__le64);
@@ -2113,7 +2113,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 		}
 
 		/* DATA or PARITY payload */
-		stripe_sect = (payload->header.type == R5LOG_PAYLOAD_DATA) ?
+		stripe_sect = (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) ?
 			raid5_compute_sector(
 				conf, le64_to_cpu(payload->location), 0, &dd,
 				NULL)
@@ -2151,7 +2151,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 			list_add_tail(&sh->lru, cached_stripe_list);
 		}
 
-		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
+		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) {
 			if (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
 			    test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags)) {
 				r5l_recovery_replay_one_stripe(conf, sh, ctx);
@@ -2159,7 +2159,7 @@ r5c_recovery_analyze_meta_block(struct r5l_log *log,
 			}
 			r5l_recovery_load_data(log, sh, ctx, payload,
 					       log_offset);
-		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY)
+		} else if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_PARITY)
 			r5l_recovery_load_parity(log, sh, ctx, payload,
 						 log_offset);
 		else
@@ -2361,7 +2361,7 @@ r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
 				payload = (void *)mb + offset;
 				payload->header.type = cpu_to_le16(
 					R5LOG_PAYLOAD_DATA);
-				payload->size = BLOCK_SECTORS;
+				payload->size = cpu_to_le32(BLOCK_SECTORS);
 				payload->location = cpu_to_le64(
 					raid5_compute_blocknr(sh, i, 0));
 				addr = kmap_atomic(dev->page);
-- 
2.5.0


^ permalink raw reply related

* [PATCH] md/raid1: skip data copy for behind io for discard request
From: Shaohua Li @ 2017-03-24 22:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Ming Lei

discard request doesn't have data attached, so it's meaningless to
allocate memory and copy from original bio for behind IO. And the copy
is bogus because bio_copy_data_partial can't handle discard request.

We don't support writesame/writezeros request so far.

Cc: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c6a671f..b7d9651 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1102,6 +1102,10 @@ static struct bio *alloc_behind_master_bio(struct r1bio *r1_bio,
 	if (!behind_bio)
 		goto fail;
 
+	/* discard op, we don't support writezero/writesame yet */
+	if (!bio_has_data(bio))
+		goto skip_copy;
+
 	while (i < vcnt && size) {
 		struct page *page;
 		int len = min_t(int, PAGE_SIZE, size);
@@ -1118,7 +1122,7 @@ static struct bio *alloc_behind_master_bio(struct r1bio *r1_bio,
 
 	bio_copy_data_partial(behind_bio, bio, offset,
 			      behind_bio->bi_iter.bi_size);
-
+skip_copy:
 	r1_bio->behind_master_bio = behind_bio;;
 	set_bit(R1BIO_BehindIO, &r1_bio->state);
 
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH 1/2] md: add raid4/5/6 journal mode API (for dm-raid use)
From: Shaohua Li @ 2017-03-24 17:29 UTC (permalink / raw)
  To: heinzm; +Cc: linux-raid
In-Reply-To: <20170322164423.20972-2-heinzm@redhat.com>

On Wed, Mar 22, 2017 at 05:44:22PM +0100, heinzm@redhat.com wrote:
> From: Heinz Mauelshagen <heinzm@redhat.com>
> 
> Upstream commit 2ded370373a400c20cf0c6e941e724e61582a867
> started the addition of "write-back" mode to MD (raid5-cache),
> i.e. support write-back caching on the raid journal device.
> 
> In order to allow the dm-raid target to switch between
> the available "write-through" and "write-back" modes,
> provide new r5c_journal_mode_set() API.
> 
> Use the new API in existing r5c_journal_mode_store()

Looks good to me, if this will be routed through dm tree, you can add:
Acked-by: Shaohua Li <shli@fb.com>


> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
> ---
>  drivers/md/raid5-cache.c | 62 ++++++++++++++++++++++++++----------------------
>  drivers/md/raid5.h       | 11 +++++++++
>  2 files changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 3f307be..218b6f3 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -53,16 +53,6 @@
>   */
>  #define R5L_POOL_SIZE	4
>  
> -/*
> - * r5c journal modes of the array: write-back or write-through.
> - * write-through mode has identical behavior as existing log only
> - * implementation.
> - */
> -enum r5c_journal_mode {
> -	R5C_JOURNAL_MODE_WRITE_THROUGH = 0,
> -	R5C_JOURNAL_MODE_WRITE_BACK = 1,
> -};
> -
>  static char *r5c_journal_mode_str[] = {"write-through",
>  				       "write-back"};
>  /*
> @@ -2327,40 +2317,56 @@ static ssize_t r5c_journal_mode_show(struct mddev *mddev, char *page)
>  	return ret;
>  }
>  
> -static ssize_t r5c_journal_mode_store(struct mddev *mddev,
> -				      const char *page, size_t length)
> +/*
> + * Set journal cache mode on @mddev (external API initially needed by dm-raid).
> + *
> + * @mode as defined in 'enum r5c_journal_mode'.
> + *
> + */
> +int r5c_journal_mode_set(struct mddev *mddev, int mode)
>  {
>  	struct r5conf *conf = mddev->private;
>  	struct r5l_log *log = conf->log;
> -	int val = -1, i;
> -	int len = length;
>  
>  	if (!log)
>  		return -ENODEV;
>  
> -	if (len && page[len - 1] == '\n')
> -		len -= 1;
> -	for (i = 0; i < ARRAY_SIZE(r5c_journal_mode_str); i++)
> -		if (strlen(r5c_journal_mode_str[i]) == len &&
> -		    strncmp(page, r5c_journal_mode_str[i], len) == 0) {
> -			val = i;
> -			break;
> -		}
> -	if (val < R5C_JOURNAL_MODE_WRITE_THROUGH ||
> -	    val > R5C_JOURNAL_MODE_WRITE_BACK)
> +	if (mode < R5C_JOURNAL_MODE_WRITE_THROUGH ||
> +	    mode > R5C_JOURNAL_MODE_WRITE_BACK)
>  		return -EINVAL;
>  
>  	if (raid5_calc_degraded(conf) > 0 &&
> -	    val == R5C_JOURNAL_MODE_WRITE_BACK)
> +	    mode == R5C_JOURNAL_MODE_WRITE_BACK)
>  		return -EINVAL;
>  
>  	mddev_suspend(mddev);
> -	conf->log->r5c_journal_mode = val;
> +	conf->log->r5c_journal_mode = mode;
>  	mddev_resume(mddev);
>  
>  	pr_debug("md/raid:%s: setting r5c cache mode to %d: %s\n",
> -		 mdname(mddev), val, r5c_journal_mode_str[val]);
> -	return length;
> +		 mdname(mddev), mode, r5c_journal_mode_str[mode]);
> +	return 0;
> +}
> +EXPORT_SYMBOL(r5c_journal_mode_set);
> +
> +static ssize_t r5c_journal_mode_store(struct mddev *mddev,
> +				      const char *page, size_t length)
> +{
> +	int mode = ARRAY_SIZE(r5c_journal_mode_str);
> +	size_t len = length;
> +
> +	if (len < 2)
> +		return -EINVAL;
> +
> +	if (page[len - 1] == '\n')
> +		len--;
> +
> +	while (mode--)
> +		if (strlen(r5c_journal_mode_str[mode]) == len &&
> +		    !strncmp(page, r5c_journal_mode_str[mode], len))
> +			break;
> +
> +	return r5c_journal_mode_set(mddev, mode) ?: length;
>  }
>  
>  struct md_sysfs_entry
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 4bb27b9..ec8ca15 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -547,6 +547,16 @@ struct r5worker_group {
>  	int stripes_cnt;
>  };
>  
> +/*
> + * r5c journal modes of the array: write-back or write-through.
> + * write-through mode has identical behavior as existing log only
> + * implementation.
> + */
> +enum r5c_journal_mode {
> +	R5C_JOURNAL_MODE_WRITE_THROUGH = 0,
> +	R5C_JOURNAL_MODE_WRITE_BACK = 1,
> +};
> +
>  enum r5_cache_state {
>  	R5_INACTIVE_BLOCKED,	/* release of inactive stripes blocked,
>  				 * waiting for 25% to be free
> @@ -795,4 +805,5 @@ extern void r5c_check_cached_full_stripe(struct r5conf *conf);
>  extern struct md_sysfs_entry r5c_journal_mode;
>  extern void r5c_update_on_rdev_error(struct mddev *mddev);
>  extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
> +extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode);
>  #endif
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 1/2] md/raid5: Add change_consistency_policy to raid4 and raid6
From: Shaohua Li @ 2017-03-24 17:08 UTC (permalink / raw)
  To: Artur Paszkiewicz
  Cc: Song Liu, linux-raid, shli, neilb, kernel-team, dan.j.williams,
	hch, jes.sorensen
In-Reply-To: <8acaaf1e-7e67-2c9b-fc1f-7c9237131260@intel.com>

On Fri, Mar 24, 2017 at 04:17:16PM +0100, Artur Paszkiewicz wrote:
> On 03/24/2017 03:45 PM, Shaohua Li wrote:
> > On Thu, Mar 16, 2017 at 03:14:08PM -0700, Song Liu wrote:
> >> Add change_consistency_policy to raid6_personality and
> >> raid4_personality.
> >>
> >> Signed-off-by: Song Liu <songliubraving@fb.com>
> >> ---
> >>  drivers/md/raid5.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> index 88cc898..7a6e7ea 100644
> >> --- a/drivers/md/raid5.c
> >> +++ b/drivers/md/raid5.c
> >> @@ -8408,6 +8408,7 @@ static struct md_personality raid6_personality =
> >>  	.quiesce	= raid5_quiesce,
> >>  	.takeover	= raid6_takeover,
> >>  	.congested	= raid5_congested,
> >> +	.change_consistency_policy = raid5_change_consistency_policy,
> >>  };
> >>  static struct md_personality raid5_personality =
> >>  {
> >> @@ -8456,6 +8457,7 @@ static struct md_personality raid4_personality =
> >>  	.quiesce	= raid5_quiesce,
> >>  	.takeover	= raid4_takeover,
> >>  	.congested	= raid5_congested,
> >> +	.change_consistency_policy = raid5_change_consistency_policy,
> >>  };
> > sorry for the late reply.
> > 
> > the change_consistency_policy is only for raid5 for a reason, ppl is only for
> > raid5. I think you need to filter out raid 4/6 in
> > raid5_change_consistency_policy for ppl.
> 
> Not necessarily. There is a check in ppl_init_log() that won't allow
> starting ppl if it's not raid5. But it certainly wouldn't hurt.

It will still set the PPL flag for raid4/6. Checking raid 5 in
change_consistency_policy is clearer anyway.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v3 02/14] md: move two macros into md.h
From: Shaohua Li @ 2017-03-24 16:53 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ming Lei, Jens Axboe, linux-raid, linux-block, Christoph Hellwig
In-Reply-To: <87tw6j8be6.fsf@notabene.neil.brown.name>

On Fri, Mar 24, 2017 at 04:57:37PM +1100, Neil Brown wrote:
> On Fri, Mar 17 2017, Ming Lei wrote:
> 
> > Both raid1 and raid10 share common resync
> > block size and page count, so move them into md.h.
> 
> I don't think this is necessary.
> These are just "magic" numbers.  They don't have any real
> meaning and so don't belong in md.h, or and .h file.
> 
> Possibly we should find more meaningful numbers, or make them auto-size
> or something.  I'm also happy for them to stay as they are for now.
> But I don't think we should pretend that they are meaningful.

I had the same concern when I looked at this patch firstly. The number for
raid1/10 doesn't need to be the same. But if we don't move the number to a
generic header, the third patch will become a little more complicated. I
eventually ignored this issue. If we really need different number for raid1/10,
lets do it at that time.

I think your suggestion that moving the number to raid1-10.h makes sense, and
add a comment declaring the number isn't required to be the same for raid1/10.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-03-24 16:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Artur Paszkiewicz, shli, linux-raid
In-Reply-To: <87wpbib88g.fsf@notabene.neil.brown.name>

On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote:
> On Thu, Mar 09 2017, Artur Paszkiewicz wrote:
> 
> > Implement the calculation of partial parity for a stripe and PPL write
> > logging functionality. The description of PPL is added to the
> > documentation. More details can be found in the comments in raid5-ppl.c.
> >
> > Attach a page for holding the partial parity data to stripe_head.
> > Allocate it only if mddev has the MD_HAS_PPL flag set.
> >
> > Partial parity is the xor of not modified data chunks of a stripe and is
> > calculated as follows:
> >
> > - reconstruct-write case:
> >   xor data from all not updated disks in a stripe
> >
> > - read-modify-write case:
> >   xor old data and parity from all updated disks in a stripe
> >
> > Implement it using the async_tx API and integrate into raid_run_ops().
> > It must be called when we still have access to old data, so do it when
> > STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> > stored into sh->ppl_page.
> >
> > Partial parity is not meaningful for full stripe write and is not stored
> > in the log or used for recovery, so don't attempt to calculate it when
> > stripe has STRIPE_FULL_WRITE.
> >
> > Put the PPL metadata structures to md_p.h because userspace tools
> > (mdadm) will also need to read/write PPL.
> >
> > Warn about using PPL with enabled disk volatile write-back cache for
> > now. It can be removed once disk cache flushing before writing PPL is
> > implemented.
> >
> > Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> 
> Sorry for the delay in getting to this for review...
> 
> > +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> > +					  struct stripe_head *sh)
> > +{
> > +	struct ppl_conf *ppl_conf = log->ppl_conf;
> > +	struct ppl_io_unit *io;
> > +	struct ppl_header *pplhdr;
> > +
> > +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> > +	if (!io)
> > +		return NULL;
> > +
> > +	memset(io, 0, sizeof(*io));
> > +	io->log = log;
> > +	INIT_LIST_HEAD(&io->log_sibling);
> > +	INIT_LIST_HEAD(&io->stripe_list);
> > +	atomic_set(&io->pending_stripes, 0);
> > +	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
> > +
> > +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
> 
> I'm trying to understand how these two mempool_alloc()s relate, and
> particularly why the first one needs to be GFP_ATOMIC, while the second
> one can safely be GFP_NOIO.
> I see that the allocated memory is freed in different places:
> header_page is called from the bi_endio function as soon as the write
> completes, while 'io' is freed later.  But I'm not sure that is enough
> to make it safe.
> 
> When working with mempools, you need to assume that the pool only
> contains one element, and that every time you call mempool_alloc(), it
> waits for that one element to be available.  While that doesn't usually
> happen, it is possible and if that case isn't handled correctly, the
> system can deadlock.
> 
> If no memory is available when this mempool_alloc() is called, it will
> block.  As it is called from the raid5d thread, the whole array will
> block.  So this can only complete safely is the write request has
> already been submitted - or if there is some other workqueue which
> submit requests after a timeout or similar.
> I don't see that in the code.  These ppl_io_unit structures can queue up
> and are only submitted later by raid5d (I think).  So if raid5d waits
> for one to become free, it will wait forever.
> 
> One easy way around this problem (assuming my understanding is correct)
> is to just have a single mempool which allocates both a struct
> ppl_io_unit and a page.  You would need to define you own alloc/free
> routines for the pool but that is easy enough.
> 
> Then you only need a single mempool_alloc(), which can sensibly be
> GFP_ATOMIC.
> If that fails, you queue for later handling as you do now.  If it
> succeeds, then you continue to use the memory without any risk of
> deadlocking.

Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with
commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool
does make sense.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v3 08/14] block: introduce bio_copy_data_partial
From: Jens Axboe @ 2017-03-24 16:41 UTC (permalink / raw)
  To: Ming Lei, Shaohua Li, linux-raid, linux-block, Christoph Hellwig
In-Reply-To: <20170316161235.27110-9-tom.leiming@gmail.com>

On 03/16/2017 10:12 AM, Ming Lei wrote:
> Turns out we can use bio_copy_data in raid1's write behind,
> and we can make alloc_behind_pages() more clean/efficient,
> but we need to partial version of bio_copy_data().
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>

Reviewed-by: Jens Axboe <axboe@fb.com>

Shaohua, feel free to pull this through the md tree, that will be much
easier.

-- 
Jens Axboe


^ permalink raw reply

* Re: md/raid5: report one failure scenario about growing array from mirror to level 5
From: Shaohua Li @ 2017-03-24 15:51 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: NeilBrown, linux-raid@vger.kernel.org
In-Reply-To: <a0a4cb0b-94d3-a54d-364f-54f02af6e09a@suse.com>

On Thu, Mar 16, 2017 at 10:14:04PM +0800, Zhilong Liu wrote:
> hi, list;
> 
>   I just trace the code of the following scenarios, grow array from mirror
>   to level 5,  it would be stuck at the beginning of reshape_progress when
>   using the loop devices, the same testing steps work well via to hard disk
>   devices.  refer to the detail steps as follow.
>   Is this issue only happened on my site?
> 
> *Steps for loop devices:*
> 1). create one mirror array via to mdadm source code.
> #  cd mdadm/
> #  ./test setup
> #  ./mdadm -CR /dev/md0 --level 1 -b internal -n2 /dev/loop[0-1]
> #  dmesg -c,    clean the dmesg.
> 2). triggers the reshape_request, then it would be stuck when start
> reshape_progress.
> ... ...
> linux-x4lv:~/mdadm-test # ./mdadm --grow /dev/md0 -l5 -n3 -a /dev/loop2
> mdadm: level of /dev/md0 changed to raid5
> mdadm: added /dev/loop2
> mdadm: Need to backup 128K of critical section..
> 
> linux-x4lv:~/mdadm-test # cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md0 : active raid5 loop2[2] loop1[1] loop0[0]
>       19968 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>       [>....................]  reshape =  0.0% (1/19968) finish=41.5min
> speed=0K/sec
>       bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> 
> linux-x4lv:~/mdadm-test # dmesg -c
> [111544.283359] md/raid:md0: device loop1 operational as raid disk 1
> [111544.283362] md/raid:md0: device loop0 operational as raid disk 0
> [111544.296161] md/raid:md0: raid level 5 active with 2 out of 2 devices,
> algorithm 2
> [111544.296178] md/raid456: discard support disabled due to uncertainty.
> [111544.296178] Set raid456.devices_handle_discard_safely=Y to override.
> [111544.932238] md: reshape of RAID array md0
> 
> *Steps for hard disks:*
> # ./mdadm --grow /dev/md0 -l5 -n3  -a /dev/sdd
> mdadm: level of /dev/md0 changed to raid5
> mdadm: added /dev/sdd
> mdadm: Need to backup 128K of critical section..
> 
> # cat /proc/mdstat
> Personalities : [raid1] [raid0] [raid6] [raid5] [raid4]
> md0 : active raid5 sdd[2] sdc[1] sdb[0]
>       102272 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>       [====>................]  reshape = 22.5% (23168/102272) finish=0.5min
> speed=2316K/sec
>       bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> unused devices: <none>

can't reproduce locally. does this only exist in specific mdadm/kernel version?

Thanks,
Shaohua

^ permalink raw reply

* Re: mdadm: one question about the readonly and readwrite feature
From: Zhilong @ 2017-03-24 15:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: Guoqing Jiang, Jes Sorensen, linux-raid@vger.kernel.org
In-Reply-To: <87d1d7a57j.fsf@notabene.neil.brown.name>





发自我的 iPhone
> 在 2017年3月24日，08:28，NeilBrown <neilb@suse.com> 写道：
> 
>> On Thu, Mar 23 2017, Zhilong Liu wrote:
>> 
>>> On 03/23/2017 03:06 PM, NeilBrown wrote:
>>> Why?
>>> How do the actions performed by set_disk_ro() interact with the actions
>>> performed by md_wakeup_thread()?
>>> 
>>> I'm not saying the code shouldn't be changed, just that there needs to
>>> be a clear explanation of the benefit.
>> 
>> here just according to my understanding for readonly code path, welcome
>> the correction in my comments if I misunderstand this feature, :-).
> 
> I'm sorry if this sounds harsh, but I get the impression that you aren't
> really trying to understand the code.  You are just guessing about what
> things might be doing, rather than doing careful research to determine
> exactly what the code does.
> 
> Do you know what "set_disk_ro()" does?  What are the consequences of
> call it?  Until you know that, you cannot make any reasonable assessment
> on where the call should go.
> 
> Do you know why we set MD_RECOVERY_NEEDED and wake up the thread?
> You seem to expect that it might cause some write out, however it
> happens just after a call to __md_stop_writes() which aims to stop all
> writes happening to the array.  So that seems unlikely (but is worth
> checking).
> 
> 
>> ... ...
>> static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
>> {
>>         int err = 0;
>>         int did_freeze = 0;
>> 
>>         if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
>>                 did_freeze = 1;
>>                 set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);  --> 
>> here has set bit as FROZEN
>>                 md_wakeup_thread(mddev->thread);
>>         }
>> ... ...
>> 
>>         if (mddev->pers) {
>>                 __md_stop_writes(mddev);
>> /*
>>  *  set FROZEN again, maybe can use test_and_set_bit(FROZEN) better, it 
>> doesn't matter.
> 
> Correct, it doesn't matter.
> 
> 
>>  *  it flushed and synced all data, bitmap and superblock to array.
> 
> Correct again.
> 
> 
>>  */
>>                 err  = -ENXIO;
>>                 if (mddev->ro==1)
>>                         goto out;
>>                 mddev->ro = 1;
>> /*
>>  *  Here, I means that set_disk_ro until the daemon thread has 
>> completed all operations
>>  *  include of sync and recovery progress. set_disk_ro when the array 
>> has cleaned totally,
>>  *  then it would be safer.
> 
> Completed which operations exactly?  __md_stop_writes() has called
> md_reap_sync_thread() so there cannot still be any recovery.  It has
> called pers->quiesce() so there cannot be any outstanding io.
> And just moving the call to afterwards would't cause it to wait for those
> supposed operations to complete.
> 
>>  *  I'm not sure MD_RECOVERY_NEEDED would be running once the array has 
>> set_disk_ro,
>>  *  actually I don't know how to test this scenario, thus asked this 
>> question.
> 
> The first step is to understand the code.
> Your questions was "should we move this line to here", without asking any
> questions about what the code does, or showing much understanding of it.
> 
> I do encourage you to ask questions, but it is best to start with
> questions that make sense.
> And once you have framed a question, try to answer it yourself.
> Read the code (e.g. the code for set_disk_ro()) if you haven't read it
> before, to be sure you understand what it does.
> And if you don't know why some code does something, it often helps to
> look through the git logs, or use "git blame" to find out when the code
> was added or changed.  Often the changelog of the patch will explain the
> purpose of the code.
> 

Sorry for the later response. I have a sick leave and go to hospital today.
many thanks, every suggestion is very inspiring to me. As a new student of this interesting project, I should keep more strict to myself, and I would continue to learn this project carefully and try to bring more meaningful question here.

For the above patch to clear the MD_CLOSING bit, I have tested and it works well. :-). 

Best regards,
-Zhilong

> NeilBrown
> 


^ permalink raw reply

* Re: [PATCH v2 1/2] md/raid5: Add change_consistency_policy to raid4 and raid6
From: Artur Paszkiewicz @ 2017-03-24 15:17 UTC (permalink / raw)
  To: Shaohua Li, Song Liu
  Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch,
	jes.sorensen
In-Reply-To: <20170324144535.chk4jgktism4mvlu@kernel.org>

On 03/24/2017 03:45 PM, Shaohua Li wrote:
> On Thu, Mar 16, 2017 at 03:14:08PM -0700, Song Liu wrote:
>> Add change_consistency_policy to raid6_personality and
>> raid4_personality.
>>
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>>  drivers/md/raid5.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 88cc898..7a6e7ea 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -8408,6 +8408,7 @@ static struct md_personality raid6_personality =
>>  	.quiesce	= raid5_quiesce,
>>  	.takeover	= raid6_takeover,
>>  	.congested	= raid5_congested,
>> +	.change_consistency_policy = raid5_change_consistency_policy,
>>  };
>>  static struct md_personality raid5_personality =
>>  {
>> @@ -8456,6 +8457,7 @@ static struct md_personality raid4_personality =
>>  	.quiesce	= raid5_quiesce,
>>  	.takeover	= raid4_takeover,
>>  	.congested	= raid5_congested,
>> +	.change_consistency_policy = raid5_change_consistency_policy,
>>  };
> sorry for the late reply.
> 
> the change_consistency_policy is only for raid5 for a reason, ppl is only for
> raid5. I think you need to filter out raid 4/6 in
> raid5_change_consistency_policy for ppl.

Not necessarily. There is a check in ppl_init_log() that won't allow
starting ppl if it's not raid5. But it certainly wouldn't hurt.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v2 2/2] md/raid5: use consistency_policy to remove journal feature
From: Shaohua Li @ 2017-03-24 14:48 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch,
	jes.sorensen
In-Reply-To: <20170316221409.3283344-2-songliubraving@fb.com>

On Thu, Mar 16, 2017 at 03:14:09PM -0700, Song Liu wrote:
> When journal device of an array fails, the array is forced into read-only
> mode. To make the array normal without adding another journal device, we
> need to remove journal _feature_ from the array.
> 
> This patch allows remove journal _feature_ from an array, For journal
> existing journal should be either missing or faulty.
> 
> To remove journal feature, one can simply echo into the sysfs file:
> 
>  cat /sys/block/md0/md/consistency_policy
>  journal
> 
>  echo resync > /sys/block/md0/md/consistency_policy
>  cat /sys/block/md0/md/consistency_policy
>  resync

please describe the detail steps what you want to do after a journal disk
failure here, like remove the disk. maybe create a new test case too.

> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7a6e7ea..2302ec4 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8369,11 +8369,19 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
>  		if (!err)
>  			raid5_reset_stripe_cache(mddev);
>  		mddev_resume(mddev);
> -	} else if (strncmp(buf, "resync", 6) == 0 && raid5_has_ppl(conf)) {
> -		mddev_suspend(mddev);
> -		log_exit(conf);
> -		raid5_reset_stripe_cache(mddev);
> -		mddev_resume(mddev);
> +	} else if (strncmp(buf, "resync", 6) == 0) {
> +		if (raid5_has_ppl(conf)) {
> +			mddev_suspend(mddev);
> +			log_exit(conf);
> +			raid5_reset_stripe_cache(mddev);
> +			mddev_resume(mddev);
> +		} else if (test_bit(MD_HAS_JOURNAL, &conf->mddev->flags) &&
> +			   r5l_log_disk_error(conf)) {
> +			mddev_suspend(mddev);
> +			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
> +			mddev_resume(mddev);

should this happen after journal disk is removed? I'm wondering if this need
extra checks.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 1/2] md/raid5: Add change_consistency_policy to raid4 and raid6
From: Shaohua Li @ 2017-03-24 14:45 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch,
	jes.sorensen
In-Reply-To: <20170316221409.3283344-1-songliubraving@fb.com>

On Thu, Mar 16, 2017 at 03:14:08PM -0700, Song Liu wrote:
> Add change_consistency_policy to raid6_personality and
> raid4_personality.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 88cc898..7a6e7ea 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8408,6 +8408,7 @@ static struct md_personality raid6_personality =
>  	.quiesce	= raid5_quiesce,
>  	.takeover	= raid6_takeover,
>  	.congested	= raid5_congested,
> +	.change_consistency_policy = raid5_change_consistency_policy,
>  };
>  static struct md_personality raid5_personality =
>  {
> @@ -8456,6 +8457,7 @@ static struct md_personality raid4_personality =
>  	.quiesce	= raid5_quiesce,
>  	.takeover	= raid4_takeover,
>  	.congested	= raid5_congested,
> +	.change_consistency_policy = raid5_change_consistency_policy,
>  };
sorry for the late reply.

the change_consistency_policy is only for raid5 for a reason, ppl is only for
raid5. I think you need to filter out raid 4/6 in
raid5_change_consistency_policy for ppl.

Thanks,
Shaohua
>  
>  static int __init raid5_init(void)
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
From: Dmitry Vyukov @ 2017-03-24 14:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Alexander Potapenko, Michael Davidson,
	Michal Marek, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Herbert Xu, David S. Miller, Shaohua Li, Matthias Kaehlcke,
	x86@kernel.org, open list:KERNEL BUILD + fi..., LKML,
	linux-crypto, linux-raid, kbuild-all, Fengguang Wu
In-Reply-To: <20170324141053.lte3qq7mfl6krlkb@hirez.programming.kicks-ass.net>

On Fri, Mar 24, 2017 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 24, 2017 at 02:50:24PM +0100, Dmitry Vyukov wrote:
>> OK, I guess should not have referenced the llvm-linux page.
>> So here are reasons on our side that I am ready to vouch:
>>
>>  - clang make it possible to implement KMSAN (dynamic detection of
>> uses of uninit memory)
>
> How does GCC make this impossible?

Too complex and too difficult to implement correctly on all corner
cases. All other sanitizers were ported to gcc very quickly, but msan
wasn't. Nobody is brave enough to even approach it.

>>  - better code coverage for fuzzing
>
> How so? Why can't the same be achieved using GCC?

Same reason.

>>  - why simpler and faster development (e.g. we can port our user-space
>> hardening technologies -- CFI and SafeStack)
>
> That's just because you've already implemented this in clang, right? So
> less work for you. Not because its impossible.

I am not saying that it's impossible. It would just require
unreasonable amount of time, and then perpetual maintenance to fix
corner cases and regressions.

For background: I implemented the current fuzzing coverage (KCOV) in
gcc, and user-space tsan instrumentation in gcc.

^ permalink raw reply

* Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
From: Peter Zijlstra @ 2017-03-24 14:10 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Borislav Petkov, Alexander Potapenko, Michael Davidson,
	Michal Marek, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Herbert Xu, David S. Miller, Shaohua Li, Matthias Kaehlcke,
	x86@kernel.org, open list:KERNEL BUILD + fi..., LKML,
	linux-crypto, linux-raid, kbuild-all, Fengguang Wu
In-Reply-To: <CACT4Y+bRWePGuZB-HCGBhHuENHAM=uDDv3i81fwZjNUKhNY2UA@mail.gmail.com>

On Fri, Mar 24, 2017 at 02:50:24PM +0100, Dmitry Vyukov wrote:
> OK, I guess should not have referenced the llvm-linux page.
> So here are reasons on our side that I am ready to vouch:
> 
>  - clang make it possible to implement KMSAN (dynamic detection of
> uses of uninit memory)

How does GCC make this impossible?

>  - better code coverage for fuzzing

How so? Why can't the same be achieved using GCC?

>  - why simpler and faster development (e.g. we can port our user-space
> hardening technologies -- CFI and SafeStack)

That's just because you've already implemented this in clang, right? So
less work for you. Not because its impossible.

^ permalink raw reply

* Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
From: Peter Zijlstra @ 2017-03-24 14:09 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: H. Peter Anvin, Michael Davidson, Alexander Potapenko,
	Michal Marek, Thomas Gleixner, Ingo Molnar, Herbert Xu,
	David S. Miller, Shaohua Li, Matthias Kaehlcke, x86@kernel.org,
	open list:KERNEL BUILD + fi..., LKML, linux-crypto, linux-raid
In-Reply-To: <CACT4Y+awYCAwoCkh8PQ-i2BdkxeuFp9meVXp=Yv6hGY8YjDnQg@mail.gmail.com>

On Fri, Mar 24, 2017 at 02:47:15PM +0100, Dmitry Vyukov wrote:
> > Seriously, you should have taken the hack the first time that this
> > needs to be fixed.  Just because this is a fairly uncommon construct
> > in the kernel doesn't mean it is not in userspace.
> 
> There is a reason why it is fairly uncommon in kernel.

So first off; its not entirely clear that the code as it exists it
correct. From a cursory reading of it and surrounding code, there is no
actual upper limit on the variable. If I were stupid enough to make a
raid with 64 devices I'd get a huge on-stack structure.

Since you're touching it; you should check these things.

And secondly, refactor the code to not look like dog vomit. You can do
more than the absolute minimal patch to make it compile, I'm sure.

^ permalink raw reply

* Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
From: Dmitry Vyukov @ 2017-03-24 13:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Alexander Potapenko, Michael Davidson,
	Michal Marek, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Herbert Xu, David S. Miller, Shaohua Li, Matthias Kaehlcke,
	x86@kernel.org, open list:KERNEL BUILD + fi..., LKML,
	linux-crypto, linux-raid, kbuild-all, Fengguang Wu
In-Reply-To: <20170317192935.d5almj4brat6uvlt@hirez.programming.kicks-ass.net>

On Fri, Mar 17, 2017 at 8:29 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 17, 2017 at 08:26:42PM +0100, Peter Zijlstra wrote:
>> On Fri, Mar 17, 2017 at 08:05:16PM +0100, Dmitry Vyukov wrote:
>> > You can also find some reasons in the Why section of LLVM-Linux project:
>> > http://llvm.linuxfoundation.org/index.php/Main_Page
>>
>> From that:
>>
>>  - LLVM/Clang is a fast moving project with many things fixed quickly
>>    and features added.
>>
>> So what's the deal with that 5 year old bug you want us to work around?
>>
>> Also, clang doesn't support asm cc flags output and a few other
>> extensions last time I checked.
>>
>
> Another great one:
>
>  - BSD License (some people prefer this license to the GPL)
>
> Seems a very weak argument to make when talking about the Linux Kernel
> which is very explicitly GPLv2 (and not later).

OK, I guess should not have referenced the llvm-linux page.
So here are reasons on our side that I am ready to vouch:

 - clang make it possible to implement KMSAN (dynamic detection of
uses of uninit memory)
 - better code coverage for fuzzing
 - why simpler and faster development (e.g. we can port our user-space
hardening technologies -- CFI and SafeStack)

Michael is on a different team and has own reasons to do this.

^ permalink raw reply

* Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
From: Dmitry Vyukov @ 2017-03-24 13:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Michael Davidson, Alexander Potapenko,
	Michal Marek, Thomas Gleixner, Ingo Molnar, Herbert Xu,
	David S. Miller, Shaohua Li, Matthias Kaehlcke, x86@kernel.org,
	open list:KERNEL BUILD + fi..., LKML, linux-crypto, linux-raid
In-Reply-To: <96158847-4F1F-4293-8435-E111F39E3260@zytor.com>

On Fri, Mar 17, 2017 at 9:04 PM,  <hpa@zytor.com> wrote:
> On March 17, 2017 12:27:46 PM PDT, Peter Zijlstra <peterz@infradead.org> wrote:
>>On Fri, Mar 17, 2017 at 11:52:01AM -0700, Michael Davidson wrote:
>>> On Fri, Mar 17, 2017 at 5:44 AM, Peter Zijlstra
>><peterz@infradead.org> wrote:
>>> >
>>> > Be that as it may; what you construct above is disgusting. Surely
>>the
>>> > code can be refactored to not look like dog vomit?
>>> >
>>> > Also; its not immediately obvious conf->copies is 'small' and this
>>> > doesn't blow up the stack; I feel that deserves a comment
>>somewhere.
>>> >
>>>
>>> I agree that the code is horrible.
>>>
>>> It is, in fact, exactly the same solution that was used to remove
>>> variable length arrays in structs from several of the crypto drivers
>>a
>>> few years ago - see the definition of SHASH_DESC_ON_STACK() in
>>> "crypto/hash.h" - I did not, however, hide the horrors in a macro
>>> preferring to leave the implementation visible as a warning to
>>whoever
>>> might touch the code next.
>>>
>>> I believe that the actual stack usage is exactly the same as it was
>>previously.
>>>
>>> I can certainly wrap this  up in a macro and add comments with
>>> appropriately dire warnings in it if you feel that is both necessary
>>> and sufficient.
>>
>>We got away with ugly in the past, so we should get to do it again?
>
> Seriously, you should have taken the hack the first time that this needs to be fixed.  Just because this is a fairly uncommon construct in the kernel doesn't mean it is not in userspace.


There is a reason why it is fairly uncommon in kernel.
Initially it was used more widely, but then there was a decision to
drop all uses of this feature. Namely:
Linus: "We should definitely drop it. The feature is an abomination".
https://lkml.org/lkml/2013/9/23/500
I really don't understand why you cling onto this last use of the
feature. Having a single use of a compiler extension on an error path
of a non-mandatory driver does not look like a great idea to me. Let's
just kill it off outside of clang discussion.

^ permalink raw reply

* RAID6 rebuild oddity
From: Brad Campbell @ 2017-03-24  7:44 UTC (permalink / raw)
  To: Linux-RAID

I'm in the process of setting up a new little array. 8 x 6TB drives in a 
RAID6. While I have the luxury of a long burn in period I've been 
beating it up and have seen some odd performance anomalies.

I have one in front of me now, so I thought I'd lay out the data and see 
if anyone has any ideas as to what might be going on.

Here's the current state. I did this by removing and adding /dev/sdb 
without a write intent bitmap to deliberately cause a rebuild.

/dev/md0:
         Version : 1.2
   Creation Time : Wed Mar 22 14:01:41 2017
      Raid Level : raid6
      Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
   Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
    Raid Devices : 8
   Total Devices : 8
     Persistence : Superblock is persistent

     Update Time : Fri Mar 24 15:34:28 2017
           State : clean, degraded, recovering
  Active Devices : 7
Working Devices : 8
  Failed Devices : 0
   Spare Devices : 1

          Layout : left-symmetric
      Chunk Size : 64K

  Rebuild Status : 0% complete

            Name : test:0  (local to host test)
            UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
          Events : 394

     Number   Major   Minor   RaidDevice State
        8       8       16        0      spare rebuilding   /dev/sdb
        1       8       32        1      active sync   /dev/sdc
        2       8       48        2      active sync   /dev/sdd
        3       8       64        3      active sync   /dev/sde
        4       8       80        4      active sync   /dev/sdf
        5       8       96        5      active sync   /dev/sdg
        6       8      128        6      active sync   /dev/sdi
        7       8      144        7      active sync   /dev/sdj

Here's the iostat output (hope it doesn't wrap).
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.05    0.00   10.42    7.85    0.00   81.68

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.60    0.00    1.40     0.00     8.80 
12.57     0.02   12.86    0.00   12.86  12.86   1.80
sdb               0.00 18835.60    0.00  657.80     0.00 90082.40 
273.89     3.72    4.71    0.00    4.71   0.85  55.80
sdc           20685.80     0.00  244.20    0.00 87659.20     0.00 
717.93     8.65   34.10   34.10    0.00   2.15  52.40
sdd           20664.60     0.00  244.60    0.00 87652.00     0.00 
716.70     8.72   34.28   34.28    0.00   2.19  53.60
sde           20647.80     0.00  240.40    0.00 87556.80     0.00 
728.43     9.13   36.54   36.54    0.00   2.30  55.40
sdf           20622.40     0.00  242.40    0.00 87556.80     0.00 
722.42     8.73   34.60   34.60    0.00   2.20  53.40
sdg           20596.00     0.00  239.20    0.00 87556.80     0.00 
732.08     9.32   37.54   37.54    0.00   2.37  56.60
sdh               0.00     1.60    0.00    1.40     0.00     8.80 
12.57     0.01    7.14    0.00    7.14   7.14   1.00
sdi           20575.80     0.00  238.20    0.00 86999.20     0.00 
730.47     8.53   34.06   34.06    0.00   2.20  52.40
sdj           22860.80     0.00  475.80    0.00 101773.60     0.00 
427.80   245.09  513.25  513.25    0.00   2.10 100.00
md1               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    2.00     0.00     8.00 
8.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00

The long and short is /dev/sdj is the last drive in the array, and is 
getting hit with a completely different read pattern to the other 
drives, causing it to bottleneck the rebuild process.

I *thought* the rebuild process was "read one stripe, calculate the 
missing bit and write it out to the drive being rebuilt".

I've seen this behaviour now a number of times, but this is the first 
time I've been able to reliably reproduce it. Of course it takes about 
20 hours to complete the rebuild, so it's a slow diagnostic process.

I've set the stripe cache size to 8192. Didn't make a dent.

The bottleneck drive seems to change depending on the load. I've seen it 
happen simply dd'ing the array to /dev/null where the transfer rate 
slows to < 150MB/s. Stop and restart the transfer and it's back up to 
500MB/s.

I've reproduced this on kernel 4.6.4 & 4.10.5, so I'm not sure what is 
going on at the moment. There is obviously a sub-optimal read pattern 
getting fed to sdj. I had a look at it with blocktrace, but went cross 
eyed trying to figure out what was going on.

The drives are all on individual lanes on a SAS controller, are set with 
the deadline scheduler and I can get about 160MB/s sustained from all 
drives simultaneously using dd.

It's not important, but I thought since I was seeing it and I have a 
month or so of extra time with this array before it needs to do useful 
work, I'd ask.

Regards,
Brad

^ permalink raw reply

* Re: constant array_state active after specific jobs
From: pdi @ 2017-03-24  7:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li
In-Reply-To: <8737e39rg0.fsf@notabene.neil.brown.name>

On Fri, 24 Mar 2017 16:25:35 +1100
NeilBrown <neilb@suse.com> wrote:

> On Thu, Mar 23 2017, pdi wrote:
> 
> > Greetings all,
> >
> > The problem in a nutshell is that an array is clean after boot,
> > until some specific jobs switch it to active where it remains until
> > reboot.
> >
> > A similar problem was discussed, and solved, in 
> > https://www.spinics.net/lists/raid/msg46450.html. However, AFAICT,
> > it is not the same issue.
> >
> > I would be grateful for any insights as to why this happens and/or
> > how to prevent it.
> >
> > The relevant info follows, please let me know if anything further
> > might help.
> >
> > Many thanks in advance.
> >
> > - uname -a
> >   Linux hostname 4.4.38 #1 SMP Sun Dec 11 16:03:41 CST 2016 x86_64
> >   Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz GenuineIntel GNU/Linux
> > - mdadm -V
> >   mdadm - v3.3.4 - 3rd August 2015
> > - Desktop drives without sct/erc,
> >   with timeout mismatch correction as per
> >   https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
> > - /dev/md9 is a raid10 array, 4 devices, far=2,
> >   with various dirs used as samba and nfs shares
> > - The array is in *constant* array_state active
> > - mdadm -D /dev/md9 | grep 'State :'
> >   State : active
> > - cat /sys/block/md9/md/array_state
> >   active
> > - watch -d 'grep md9 /proc/diskstats'
> >   remain unchanged
> > - uptime
> >   load average: 0.00, 0.00, 0.00
> > - cat /sys/block/md9/md/safe_mode_delay
> >   0.201
> > - echo 0.1 > /sys/block/md9/md/safe_mode_delay
> >   array_state remains active
> > - echo clean > /sys/block/md9/md/array_state
> >   echo: write error: Device or resource busy
> > - reboot (with or without prior check)
> >   array_state clean
> > - After reboot, array remains clean until some specific
> >   jobs put it in constant active state. Such jobs so far
> >   identified:
> >   - echo check > /sys/block/md9/md/sync_action
> >   - run an rsnapshot job
> >   - start a qemu/kvm vm
> > - Other jobs, like text/doc editing, multimedia playback,
> >   etc retain array_state clean  
> 
> This bug was introduced by
> Commit: 20d0189b1012 ("block: Introduce new bio_split()")
> in 3.14, and fixed by
> Commit: 9b622e2bbcf0 ("raid10: increment write counter after bio is
> split") in 4.8.
> 
> Maybe the latter patch should be sent to -stable ??
> 
> NeilBrown

NeilBrown, thank you for your swift and concise answer.

I gather you are referring to kernel version numbers. The described
behaviour was first noticed many months ago with kernel 2.6.37.6, and
persisted after a system upgrade and kernel 4.4.38. However, after the
upgrade two things were corrected, the timeout mismatch, and a
Current_Pending_Sector in one of the drives; which may, or may not,
explain the occurrence with the older kernel.

Is this constant active state in the data array something to worry about
and try kernel >= 4.8, or shall I let be?

pdi




^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox