Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: RAID6 rebuild oddity
From: Brad Campbell @ 2017-03-30  1:22 UTC (permalink / raw)
  To: NeilBrown, Linux-RAID; +Cc: Dan Williams
In-Reply-To: <87wpb7612w.fsf@notabene.neil.brown.name>

> NeilBrown
>
> This works for simple tests but might not be correct.
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c523fd69a7bc..2eb45d57226c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3618,8 +3618,9 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
>   		BUG_ON(test_bit(R5_Wantread, &dev->flags));
>   		BUG_ON(sh->batch_head);
>   		if ((s->uptodate == disks - 1) &&
> +		    ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
>   		    (s->failed && (disk_idx == s->failed_num[0] ||
> -				   disk_idx == s->failed_num[1]))) {
> +				   disk_idx == s->failed_num[1])))) {
>   			/* have disk failed, and we're requested to fetch it;
>   			 * do compute it
>   			 */

G'day Neil,

Thanks for the in-depth analysis. I managed to follow most of it and 
when I get some time I'll get my head into the code and see if I can 
*really* follow along.

As I'm not particularly fussed about the integrity of the system in its 
current state, I patched the kernel, re-booted and kicked off the resync 
again.


Before patch :

brad@test:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdb[8] sdj[7] sdi[9] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
       35162348160 blocks super 1.2 level 6, 64k chunk, algorithm 2 
[8/7] [UUUUUU_U]
       [==============>......]  recovery = 72.0% (4222847296/5860391360) 
finish=541.6min speed=50382K/sec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.32    0.00    7.02    1.37    0.00   90.29

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    19.00   10.20   22.40   114.40   139.00 
15.55     0.52   15.95   26.27   11.25   5.03  16.40
sdb           16514.60     0.00  141.00    0.00 67860.00     0.00 
962.55     2.64   18.91   18.91    0.00   3.21  45.20
sdc           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.06   14.77   14.77    0.00   2.79  39.20
sdd           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.64   18.92   18.92    0.00   3.04  42.80
sde           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.49   17.85   17.85    0.00   3.14  44.20
sdf           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.24   16.05   16.05    0.00   2.86  40.20
sdg           18459.80     0.00  269.40    0.00 74772.00     0.00 
555.10    84.84  327.45  327.45    0.00   3.71 100.00
sdh               0.00    19.00    1.80   22.40    24.80   139.00 
13.54     0.25   10.17   10.00   10.18   5.87  14.20
sdi               0.00 16390.20    0.00  136.60     0.00 66138.40 
968.35     1.19    8.74    0.00    8.74   3.38  46.20
sdj           16514.40     0.00  140.80    0.00 67654.40     0.00 
961.00     2.25   16.11   16.11    0.00   2.86  40.20
md1               0.00     0.00    0.00    4.80     0.00 4.60     
1.92     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00   12.00   32.60   139.20   131.20 
12.13     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00

After Patch :

root@test:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdb[8] sdj[7] sdi[9] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
       35162348160 blocks super 1.2 level 6, 64k chunk, algorithm 2 
[8/7] [UUUUUU_U]
       [==============>......]  recovery = 73.3% (4297641980/5860391360) 
finish=284.7min speed=91465K/sec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.08    0.00    6.96    0.00    0.00   92.97

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.20     0.00 0.20     
2.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb           28232.80     0.00  237.40    0.00 114502.40     0.00 
964.64     8.53   36.05   36.05    0.00   4.15  98.60
sdc           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.21    5.13    5.13    0.00   1.77  41.80
sdd           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.28    5.42    5.42    0.00   1.85  43.60
sde           28232.80     0.00  236.80    0.00 114195.20     0.00 
964.49     2.74   11.65   11.65    0.00   2.77  65.60
sdf           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.56    6.63    6.63    0.00   2.14  50.60
sdg           28232.80     0.00  234.80    0.00 113273.60     0.00 
964.85     3.06   12.86   12.86    0.00   2.82  66.20
sdh               0.00     0.00    0.00    0.20     0.00 0.20     
2.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00 28175.00    0.00  245.80     0.00 113683.20 
925.01     0.96    3.91    0.00    3.91   2.89  71.00
sdj           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.53    6.50    6.50    0.00   2.07  48.80
md1               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00

So that has significantly increased the rebuild speed, as you would expect.


Regards,
Brad
-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply

* Re: RAID6 rebuild oddity
From: NeilBrown @ 2017-03-30  1:53 UTC (permalink / raw)
  To: Brad Campbell, Linux-RAID; +Cc: Dan Williams
In-Reply-To: <a8ad0de2-5cec-1e70-0fff-1df911a02932@fnarfbargle.com>

[-- Attachment #1: Type: text/plain, Size: 580 bytes --]

On Thu, Mar 30 2017, Brad Campbell wrote:

>        [==============>......]  recovery = 72.0% (4222847296/5860391360) 
> finish=541.6min speed=50382K/sec
...
>        [==============>......]  recovery = 73.3% (4297641980/5860391360) 
> finish=284.7min speed=91465K/sec
>

>
> So that has significantly increased the rebuild speed, as you would expect.
>

Not quite twice as fast, but close!
Everyone has been suffering with that poor speed for years and nobody
noticed until you did.  Thanks for being thorough!  And thanks for
confirming the patch.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Martin K. Petersen @ 2017-03-30  2:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org,
	linux-raid-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	hch-jcswGhMUV9g@public.gmane.org,
	agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
In-Reply-To: <1490726988.2573.16.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> writes:

Hi Bart,

> A quote from SBC: "An OPTIMAL UNMAP GRANULARITY field set to a
> non-zero value indicates the optimal granularity in logical blocks for
> unmap requests (e.g., an UNMAP command or a WRITE SAME (16) command
> with the UNMAP bit set to one).  An unmap request with a number of
> logical blocks that is not a multiple of this value may result in
> unmap operations on fewer LBAs than requested."

Indeed. Fewer LBAs than requested may be *unmapped*. That does not imply
that they are not *written*.

> This means that just like the start and end of a discard must be
> aligned on a discard_granularity boundary, WRITE SAME commands with
> the UNMAP bit set must also respect that granularity. I think this
> means that either __blkdev_issue_zeroout() has to be modified such
> that it rejects unaligned REQ_OP_WRITE_ZEROES operations or that
> blk_bio_write_same_split() has to be modified such that it generates
> REQ_OP_WRITEs for the unaligned start and tail.

No, that's not correct. SBC states:

"a) if the Data-Out Buffer of the WRITE SAME command is the same as the
   logical block data returned by a read operation from that LBA while
   in the unmapped state, then:

   1) the device server performs the actions described in table 6
      [unmap]; and

   2) if an unmap operation is not performed in step 1), then the device
      server shall perform the specified write operation to that LBA;"

I.e. With WRITE SAME it is the responsibility of the device server to
write any LBAs described by the command that were not successfully
unmapped.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH] md: update slab_cache before releasing new stripes when stripes resizing
From: NeilBrown @ 2017-03-30  3:01 UTC (permalink / raw)
  To: linux-raid; +Cc: Dennis Yang
In-Reply-To: <1490773573-32692-1-git-send-email-dennisyang@qnap.com>

[-- Attachment #1: Type: text/plain, Size: 5410 bytes --]

On Wed, Mar 29 2017, Dennis Yang wrote:

> When growing raid5 device on machine with small memory, there is chance that
> mdadm will be killed and the following bug report can be observed. The same
> bug could also be reproduced in linux-4.10.6.
>
> [57600.075774] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
> [57600.114678] Oops: 0002 [#1] SMP
> [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P           O    4.2.8 #1
> [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
> [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
> [57600.204963] RIP: 0010:[<ffffffff81a6aa87>]  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.213057] RSP: 0018:ffff880043073810  EFLAGS: 00010046
> [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
> [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
> [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
> [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
> [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
> [57600.253999] FS:  00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
> [57600.262078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
> [57600.274942] Stack:
> [57600.276949]  ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
> [57600.284383]  ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
> [57600.291820]  ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
> [57600.299254] Call Trace:
> [57600.301698]  [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
> [57600.307523]  [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
> [57600.313779]  [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
> [57600.319344]  [<ffffffff81579942>] drop_one_stripe+0x62/0x90
> [57600.324915]  [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
> [57600.330563]  [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
> [57600.336650]  [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
> [57600.342039]  [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
> [57600.348210]  [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
> [57600.353959]  [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
> [57600.360303]  [<ffffffff8157a30b>] check_reshape+0x62b/0x770
> [57600.365866]  [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
> [57600.371778]  [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
> [57600.377604]  [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
> [57600.382827]  [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
> [57600.388307]  [<ffffffff81195238>] block_ioctl+0x38/0x40
> [57600.393525]  [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
> [57600.399010]  [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
> [57600.404400]  [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
> [57600.409447]  [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
> [57600.435460] RIP  [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
> [57600.441208]  RSP <ffff880043073810>
> [57600.444690] CR2: 0000000000000000
> [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---
>
> The problem is that resize_stripes() releases new stripe_heads before assigning new
> slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
> after resize_stripes() starting releasing new stripes but right before new slab cache
> being assigned, it is possible that these new stripe_heads will be freed with the old
> slab_cache which was already been destoryed and that triggers this bug.
>
> Signed-off-by: Dennis Yang <dennisyang@qnap.com>
> ---
>  drivers/md/raid5.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 6661db2c..172edc1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2286,6 +2286,10 @@ static int resize_stripes(struct r5conf *conf, int newsize)
>  		err = -ENOMEM;
>  
>  	mutex_unlock(&conf->cache_size_mutex);
> +	
> +	conf->slab_cache = sc;
> +	conf->active_name = 1-conf->active_name;
> +
>  	/* Step 4, return new stripes to service */
>  	while(!list_empty(&newstripes)) {
>  		nsh = list_entry(newstripes.next, struct stripe_head, lru);
> @@ -2303,8 +2307,6 @@ static int resize_stripes(struct r5conf *conf, int newsize)
>  	}
>  	/* critical section pass, GFP_NOIO no longer needed */
>  
> -	conf->slab_cache = sc;
> -	conf->active_name = 1-conf->active_name;
>  	if (!err)
>  		conf->pool_size = newsize;
>  	return err;
> -- 

Thanks!  I'd probably mark this for stable.  I suspect the bug was
introduced by  edbe83ab4c27

Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1+)
Reviewed-by: NeilBrown <neilb@suse.com>

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: RAID6 rebuild oddity
From: Brad Campbell @ 2017-03-30  3:09 UTC (permalink / raw)
  To: NeilBrown, Linux-RAID; +Cc: Dan Williams
In-Reply-To: <87r31f5y3m.fsf@notabene.neil.brown.name>

On 30/03/17 09:53, NeilBrown wrote:
> On Thu, Mar 30 2017, Brad Campbell wrote:
>
>>        [==============>......]  recovery = 72.0% (4222847296/5860391360)
>> finish=541.6min speed=50382K/sec
> ...
>>        [==============>......]  recovery = 73.3% (4297641980/5860391360)
>> finish=284.7min speed=91465K/sec
>>
>
>>
>> So that has significantly increased the rebuild speed, as you would expect.
>>
>
> Not quite twice as fast, but close!

It's actually more than twice as fast.
       [================>....]  recovery = 84.1% (4933147664/5860391360) 
finish=151.4min speed=102062K/sec

That sample was taken at an inopportune moment when things were just 
getting up to speed. We're progressing towards slower parts of the disk, 
and it's still twice as fast as it was.

> Everyone has been suffering with that poor speed for years and nobody
> noticed until you did.  Thanks for being thorough!  And thanks for
> confirming the patch.

I'm baffled it has been around this long and it hasn't come up previously.

Again, I really appreciate you taking a look at this.

Regards,
Brad

^ permalink raw reply

* [PATCH] mdadm/util:integrate fstat and stat into a utility function
From: Zhilong Liu @ 2017-03-30  4:46 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

mdadm/util: lots of dupilicate codes about stat and fstat
check the block device, move them into a utility function
to make it concise.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Assemble.c    |  9 +--------
 Build.c       | 10 +---------
 Create.c      |  5 +----
 Dump.c        |  3 +--
 Grow.c        |  4 +---
 Incremental.c | 27 +++------------------------
 Manage.c      |  9 +--------
 mdadm.h       |  1 +
 super-intel.c |  9 +--------
 util.c        | 15 +++++++++++++++
 10 files changed, 26 insertions(+), 66 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 6a6a56b..d3bfaa2 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -204,14 +204,7 @@ static int select_devices(struct mddev_dev *devlist,
 				pr_err("cannot open device %s: %s\n",
 				       devname, strerror(errno));
 			tmpdev->used = 2;
-		} else if (fstat(dfd, &stb)< 0) {
-			/* Impossible! */
-			pr_err("fstat failed for %s: %s\n",
-			       devname, strerror(errno));
-			tmpdev->used = 2;
-		} else if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-			pr_err("%s is not a block device.\n",
-			       devname);
+		} else if (check_block_dev(devname)) {
 			tmpdev->used = 2;
 		} else if (must_be_container(dfd)) {
 			if (st) {
diff --git a/Build.c b/Build.c
index a5fcc06..4fa606c 100644
--- a/Build.c
+++ b/Build.c
@@ -67,16 +67,8 @@ int Build(char *mddev, struct mddev_dev *devlist,
 			missing_disks++;
 			continue;
 		}
-		if (stat(dv->devname, &stb)) {
-			pr_err("Cannot find %s: %s\n",
-				dv->devname, strerror(errno));
-			return 1;
-		}
-		if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-			pr_err("%s is not a block device.\n",
-				dv->devname);
+		if (check_block_dev(dv->devname))
 			return 1;
-		}
 	}
 
 	if (s->raiddisks != subdevs) {
diff --git a/Create.c b/Create.c
index 10e7d10..dbd9ce3 100644
--- a/Create.c
+++ b/Create.c
@@ -327,11 +327,8 @@ int Create(struct supertype *st, char *mddev,
 				dname, strerror(errno));
 			exit(2);
 		}
-		if (fstat(dfd, &stb) != 0 ||
-		    (stb.st_mode & S_IFMT) != S_IFBLK) {
+		if (check_block_dev(dname)) {
 			close(dfd);
-			pr_err("%s is not a block device\n",
-				dname);
 			exit(2);
 		}
 		close(dfd);
diff --git a/Dump.c b/Dump.c
index 7bdbf6f..29fc02a 100644
--- a/Dump.c
+++ b/Dump.c
@@ -131,8 +131,7 @@ int Dump_metadata(char *dev, char *dir, struct context *c,
 		if (de->d_name[0] == '.')
 			continue;
 		xasprintf(&p, "/dev/disk/by-id/%s", de->d_name);
-		if (stat(p, &stb) != 0 ||
-		    (stb.st_mode & S_IFMT) != S_IFBLK ||
+		if (check_block_dev(p) != 0 ||
 		    stb.st_rdev != dstb.st_rdev) {
 			/* Not this one */
 			free(p);
diff --git a/Grow.c b/Grow.c
index b86b53e..063996d 100755
--- a/Grow.c
+++ b/Grow.c
@@ -145,9 +145,7 @@ int Grow_Add_device(char *devname, int fd, char *newdev)
 		free(st);
 		return 1;
 	}
-	fstat(nfd, &stb);
-	if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-		pr_err("%s is not a block device!\n", newdev);
+	if (check_block_dev(newdev)) {
 		close(nfd);
 		free(st);
 		return 1;
diff --git a/Incremental.c b/Incremental.c
index 81afc7e..b2fad39 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -108,18 +108,8 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 
 	struct createinfo *ci = conf_get_create_info();
 
-	if (stat(devname, &stb) < 0) {
-		if (c->verbose >= 0)
-			pr_err("stat failed for %s: %s.\n",
-				devname, strerror(errno));
+	if (check_block_dev(devname))
 		return rv;
-	}
-	if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-		if (c->verbose >= 0)
-			pr_err("%s is not a block device.\n",
-				devname);
-		return rv;
-	}
 	dfd = dev_open(devname, O_RDONLY);
 	if (dfd < 0) {
 		if (c->verbose >= 0)
@@ -159,8 +149,7 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 		devlist = conf_get_devs();
 		for (;devlist; devlist = devlist->next) {
 			struct stat st2;
-			if (stat(devlist->devname, &st2) == 0 &&
-			    (st2.st_mode & S_IFMT) == S_IFBLK &&
+			if (check_block_dev(devlist->devname) == 0 &&
 			    st2.st_rdev == stb.st_rdev)
 				break;
 		}
@@ -175,18 +164,8 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 	/* 2/ Find metadata, reject if none appropriate (check
 	 *            version/name from args) */
 
-	if (fstat(dfd, &stb) < 0) {
-		if (c->verbose >= 0)
-			pr_err("fstat failed for %s: %s.\n",
-				devname, strerror(errno));
+	if (check_block_dev(devname))
 		goto out;
-	}
-	if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-		if (c->verbose >= 0)
-			pr_err("%s is not a block device.\n",
-				devname);
-		goto out;
-	}
 
 	dinfo.disk.major = major(stb.st_rdev);
 	dinfo.disk.minor = minor(stb.st_rdev);
diff --git a/Manage.c b/Manage.c
index 55218d9..8687084 100644
--- a/Manage.c
+++ b/Manage.c
@@ -1543,17 +1543,10 @@ int Manage_subdevs(char *devname, int fd,
 				close(tfd);
 			} else {
 				int open_err = errno;
-				if (stat(dv->devname, &stb) != 0) {
-					pr_err("Cannot find %s: %s\n",
-					       dv->devname, strerror(errno));
-					goto abort;
-				}
-				if ((stb.st_mode & S_IFMT) != S_IFBLK) {
+				if (check_block_dev(dv->devname) != 0) {
 					if (dv->disposition == 'M')
 						/* non-fatal. Also improbable */
 						continue;
-					pr_err("%s is not a block device.\n",
-					       dv->devname);
 					goto abort;
 				}
 				if (dv->disposition == 'r')
diff --git a/mdadm.h b/mdadm.h
index dbf1f92..ad43a35 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1416,6 +1416,7 @@ extern int parse_cluster_confirm_arg(char *inp, char **devname, int *slot);
 extern int check_ext2(int fd, char *name);
 extern int check_reiser(int fd, char *name);
 extern int check_raid(int fd, char *name);
+extern int check_block_dev(char *dev);
 extern int check_partitions(int fd, char *dname,
 			    unsigned long long freesize,
 			    unsigned long long size);
diff --git a/super-intel.c b/super-intel.c
index 785488a..21b5887 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -6584,14 +6584,7 @@ count_volumes_list(struct md_list *devlist, char *homehost,
 			dprintf("cannot open device %s: %s\n",
 				devname, strerror(errno));
 			tmpdev->used = 2;
-		} else if (fstat(dfd, &stb)< 0) {
-			/* Impossible! */
-			dprintf("fstat failed for %s: %s\n",
-				devname, strerror(errno));
-			tmpdev->used = 2;
-		} else if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-			dprintf("%s is not a block device.\n",
-				devname);
+		} else if (check_block_dev(devname)) {
 			tmpdev->used = 2;
 		} else if (must_be_container(dfd)) {
 			struct supertype *cst;
diff --git a/util.c b/util.c
index 683c869..e7aeb54 100644
--- a/util.c
+++ b/util.c
@@ -750,6 +750,21 @@ int ask(char *mesg)
 }
 #endif /* MDASSEMBLE */
 
+int check_block_dev(char *dev)
+{
+	struct stat stb;
+
+	if (stat(dev, &stb) != 0) {
+		pr_err("stat failed for %s: %s\n", dev, strerror(errno));
+		return -1;
+	}
+	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
+		pr_err("%s is not a block device.\n", dev);
+		return -1;
+	}
+	return 0;
+}
+
 int is_standard(char *dev, int *nump)
 {
 	/* tests if dev is a "standard" md dev name.
-- 
2.10.2


^ permalink raw reply related

* Re: [PATCH v6 0/4] Broadcom SBA RAID support
From: Vinod Koul @ 2017-03-30  4:49 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Anup Patel, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Dan Williams, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine, Device Tree,
	Linux ARM Kernel, Linux Kernel, linux-crypto, linux-raid
In-Reply-To: <CAALAos-P6CDdWTJxg8YQfT1V+iCxUcbYHiQ35EdZzz9PnN3cuA@mail.gmail.com>

On Wed, Mar 29, 2017 at 11:35:43AM +0530, Anup Patel wrote:
> On Tue, Mar 21, 2017 at 2:48 PM, Vinod Koul <vinod.koul@intel.com> wrote:
> > On Tue, Mar 21, 2017 at 02:17:21PM +0530, Anup Patel wrote:
> >> On Tue, Mar 21, 2017 at 2:00 PM, Vinod Koul <vinod.koul@intel.com> wrote:
> >> > On Mon, Mar 06, 2017 at 03:13:24PM +0530, Anup Patel wrote:
> >> >> The Broadcom SBA RAID is a stream-based device which provides
> >> >> RAID5/6 offload.
> >> >>
> >> >> It requires a SoC specific ring manager (such as Broadcom FlexRM
> >> >> ring manager) to provide ring-based programming interface. Due to
> >> >> this, the Broadcom SBA RAID driver (mailbox client) implements
> >> >> DMA device having one DMA channel using a set of mailbox channels
> >> >> provided by Broadcom SoC specific ring manager driver (mailbox
> >> >> controller).
> >> >>
> >> >> The Broadcom SBA RAID hardware requires PQ disk position instead
> >> >> of PQ disk coefficient. To address this, we have added raid_gflog
> >> >> table which will help driver to convert PQ disk coefficient to PQ
> >> >> disk position.
> >> >>
> >> >> This patchset is based on Linux-4.11-rc1 and depends on patchset
> >> >> "[PATCH v5 0/2] Broadcom FlexRM ring manager support"
> >> >
> >> > Okay I applied and was about to push when I noticed this :(
> >> >
> >> > So what is the status of this..?
> >>
> >> PATCH2 is Acked but PATCH1 is under-review. Currently, its
> >> v6 of that patchset.
> >>
> >> The only dependency on that patchset is the changes in
> >> brcm-message.h which are required by this BCM-SBA-RAID
> >> driver.
> >>
> >> @Jassi,
> >> Can you please have a look at PATCH v6?
> >
> > And I would need an immutable branch/tag once merged. I am going to keep
> > this series pending till then.
> 
> The Broadcom FlexRM patchset is pickedup by Jassi and
> can be found in mailbox-for-next branch of
> git://git.linaro.org/landing-teams/working/fujitsu/integration
> 
> Both patchset (Broadcom FlexRM patchset and this one) are
> also available in sba-raid-v7 branch of
> https://github.com/Broadcom/arm64-linux.git

Jassi,

Can you provide an immutable branch/tag please for this, latter is
preferred.

Btw didn't find your tree in MAINTAINERS..

> 
> Regards,
> Anup

-- 
~Vinod

^ permalink raw reply

* [PATCH] raid6/altivec: adding vpermxor implementation for raid6 Q syndrome
From: Matt Brown @ 2017-03-30  5:13 UTC (permalink / raw)
  To: linux-raid; +Cc: linuxppc-dev, dja

The raid6 Q syndrome check has been optimised using the vpermxor
instruction.  This instruction was made available with POWER8, ISA version
2.07. It allows for both vperm and vxor instructions to be done in a single
instruction. This has been tested for correctness on a ppc64le vm with a
basic RAID6 setup containing 5 drives.

The performance benchmarks are from the raid6test in the /lib/raid6/test
directory. These results are from an IBM Firestone machine with ppc64le
architecture. The benchmark results show a 35% speed increase over the best
existing algorithm for powerpc (altivec). The raid6test has also been run
on a big-endian ppc64 vm to ensure it also works for big-endian
architectures.

Performance benchmarks:

	raid6: altivecx4 gen() 18773 MB/s
	raid6: altivecx8 gen() 19438 MB/s

	raid6: vpermxor4 gen() 25112 MB/s
	raid6: vpermxor8 gen() 26279 MB/s

Note: Also fixed a small bug in pq.h regarding a missing and mismatched
ifdef statement

Signed-off-by: Matt Brown <matthew.brown.dev@gmail.com>
---
 include/linux/raid/pq.h |   4 ++
 lib/raid6/Makefile      |  27 ++++++++++++-
 lib/raid6/algos.c       |   4 ++
 lib/raid6/altivec.uc    |   3 ++
 lib/raid6/test/Makefile |  28 ++++++++++++-
 lib/raid6/vpermxor.uc   | 102 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 165 insertions(+), 3 deletions(-)
 create mode 100644 lib/raid6/vpermxor.uc

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..3df9aa6 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2;
 extern const struct raid6_calls raid6_avx512x4;
 extern const struct raid6_calls raid6_tilegx8;
 extern const struct raid6_calls raid6_s390vx8;
+extern const struct raid6_calls raid6_vpermxor1;
+extern const struct raid6_calls raid6_vpermxor2;
+extern const struct raid6_calls raid6_vpermxor4;
+extern const struct raid6_calls raid6_vpermxor8;
 
 struct raid6_recov_calls {
 	void (*data2)(int, size_t, int, int, void **);
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 3057011..7775aad 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -4,7 +4,8 @@ raid6_pq-y	+= algos.o recov.o tables.o int1.o int2.o int4.o \
 		   int8.o int16.o int32.o
 
 raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o avx512.o recov_avx512.o
-raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
+raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
+				vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
 raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
@@ -88,6 +89,30 @@ $(obj)/altivec8.c:   UNROLL := 8
 $(obj)/altivec8.c:   $(src)/altivec.uc $(src)/unroll.awk FORCE
 	$(call if_changed,unroll)
 
+CFLAGS_vpermxor1.o += $(altivec_flags)
+targets += vpermxor1.c
+$(obj)/vpermxor1.c: UNROLL := 1
+$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+	$(call if_changed,unroll)
+
+CFLAGS_vpermxor2.o += $(altivec_flags)
+targets += vpermxor2.c
+$(obj)/vpermxor2.c: UNROLL := 2
+$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+	$(call if_changed,unroll)
+
+CFLAGS_vpermxor4.o += $(altivec_flags)
+targets += vpermxor4.c
+$(obj)/vpermxor4.c: UNROLL := 4
+$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+	$(call if_changed,unroll)
+
+CFLAGS_vpermxor8.o += $(altivec_flags)
+targets += vpermxor8.c
+$(obj)/vpermxor8.c: UNROLL := 8
+$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+	$(call if_changed,unroll)
+
 CFLAGS_neon1.o += $(NEON_FLAGS)
 targets += neon1.c
 $(obj)/neon1.c:   UNROLL := 1
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 7857049..edd4f69 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = {
 	&raid6_altivec2,
 	&raid6_altivec4,
 	&raid6_altivec8,
+	&raid6_vpermxor1,
+	&raid6_vpermxor2,
+	&raid6_vpermxor4,
+	&raid6_vpermxor8,
 #endif
 #if defined(CONFIG_TILEGX)
 	&raid6_tilegx8,
diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc
index 682aae8..d20ed0d 100644
--- a/lib/raid6/altivec.uc
+++ b/lib/raid6/altivec.uc
@@ -24,10 +24,13 @@
 
 #include <linux/raid/pq.h>
 
+#ifdef CONFIG_ALTIVEC
+
 #include <altivec.h>
 #ifdef __KERNEL__
 # include <asm/cputable.h>
 # include <asm/switch_to.h>
+#endif /* __KERNEL__ */
 
 /*
  * This is the C data type to use.  We use a vector of
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 2c7b60e..29ebb39 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -47,13 +47,25 @@ else
                          gcc -c -x c - >&/dev/null && \
                          rm ./-.o && echo yes)
         ifeq ($(HAS_ALTIVEC),yes)
-                OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
+		OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
         endif
 endif
 ifeq ($(ARCH),tilegx)
 OBJS += tilegx8.o
 endif
 
+ifeq ($(ARCH),ppc64le)
+	CFLAGS += -I../../../arch/powerpc/include
+	CFLAGS += -DCONFIG_ALTIVEC
+	OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
+		vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
+else ifeq ($(ARCH),ppc64)
+	CFLAGS += -I../../../arch/powerpc/include
+	CFLAGS += -DCONFIG_ALTIVEC
+	OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
+		vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
+endif
+
 .c.o:
 	$(CC) $(CFLAGS) -c -o $@ $<
 
@@ -97,6 +109,18 @@ altivec4.c: altivec.uc ../unroll.awk
 altivec8.c: altivec.uc ../unroll.awk
 	$(AWK) ../unroll.awk -vN=8 < altivec.uc > $@
 
+vpermxor1.c: vpermxor.uc ../unroll.awk
+	$(AWK) ../unroll.awk -vN=1 < vpermxor.uc > $@
+
+vpermxor2.c: vpermxor.uc ../unroll.awk
+	$(AWK) ../unroll.awk -vN=2 < vpermxor.uc > $@
+
+vpermxor4.c: vpermxor.uc ../unroll.awk
+	$(AWK) ../unroll.awk -vN=4 < vpermxor.uc > $@
+
+vpermxor8.c: vpermxor.uc ../unroll.awk
+	$(AWK) ../unroll.awk -vN=8 < vpermxor.uc > $@
+
 int1.c: int.uc ../unroll.awk
 	$(AWK) ../unroll.awk -vN=1 < int.uc > $@
 
@@ -122,7 +146,7 @@ tables.c: mktables
 	./mktables > tables.c
 
 clean:
-	rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c neon*.c tables.c raid6test
+	rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c vpermxor*.c neon*.c tables.c raid6test
 	rm -f tilegx*.c
 
 spotless: clean
diff --git a/lib/raid6/vpermxor.uc b/lib/raid6/vpermxor.uc
new file mode 100644
index 0000000..96f48e3
--- /dev/null
+++ b/lib/raid6/vpermxor.uc
@@ -0,0 +1,102 @@
+/*
+ * Copyright 2017, Matt Brown, IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * vpermxor$#.c
+ *
+ * $#-way unrolled portable integer math RAID-6 instruction set
+ * This file is postprocessed using unroll.awk
+ *
+ * vpermxor$#.c makes use of the vpermxor opcode to optimise the RAID6 Q
+ * syndrome calculations.
+ * This can be run on systems which have both Altivec and the vpermxor opcode.
+ *
+ * This instruction was introduced in POWER8 - ISA v2.07.
+ */
+
+#include <linux/raid/pq.h>
+#ifdef CONFIG_ALTIVEC
+
+#include <altivec.h>
+#ifdef __KERNEL__
+#include <asm/cputable.h>
+#include <asm/switch_to.h>
+#endif
+
+typedef vector unsigned char unative_t;
+#define NSIZE sizeof(unative_t)
+
+static const vector unsigned char gf_low = {0x1e, 0x1c, 0x1a, 0x18, 0x16, 0x14,
+					    0x12, 0x10, 0x0e, 0x0c, 0x0a, 0x08,
+					    0x06, 0x04, 0x02,0x00};
+static const vector unsigned char gf_high = {0xfd, 0xdd, 0xbd, 0x9d, 0x7d, 0x5d,
+					     0x3d, 0x1d, 0xe0, 0xc0, 0xa0, 0x80,
+					     0x60, 0x40, 0x20, 0x00};
+
+static void noinline raid6_vpermxor$#_gen_syndrome_real(int disks, size_t bytes,
+							void **ptrs)
+{
+	u8 **dptr = (u8 **)ptrs;
+	u8 *p, *q;
+	int d, z, z0;
+	unative_t wp$$, wq$$, wd$$;
+
+	z0 = disks - 3;		/* Highest data disk */
+	p = dptr[z0+1];		/* XOR parity */
+	q = dptr[z0+2];		/* RS syndrome */
+
+	for (d = 0; d < bytes; d += NSIZE*$#) {
+		wp$$ = wq$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE];
+
+		for (z = z0-1; z>=0; z--) {
+			wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE];
+			/* P syndrome */
+			wp$$ = vec_xor(wp$$, wd$$);
+
+			/*Q syndrome */
+			__asm__ __volatile__("vpermxor %0,%1,%2,%3\n\t":"=v"(wq$$):"v"(gf_high), "v"(gf_low), "v"(wq$$));
+			wq$$ = vec_xor(wq$$, wd$$);
+		}
+		*(unative_t *)&p[d+NSIZE*$$] = wp$$;
+		*(unative_t *)&q[d+NSIZE*$$] = wq$$;
+	}
+}
+
+static void raid6_vpermxor$#_gen_syndrome(int disks, size_t bytes, void **ptrs)
+{
+	preempt_disable();
+	enable_kernel_altivec();
+
+	raid6_vpermxor$#_gen_syndrome_real(disks, bytes, ptrs);
+
+	disable_kernel_altivec();
+	preempt_enable();
+}
+
+int raid6_have_altivec_vpermxor(void);
+#if $# == 1
+int raid6_have_altivec_vpermxor(void)
+{
+	/* Check if CPU has both altivec and the vpermxor instruction*/
+# ifdef __KERNEL__
+	return (cpu_has_feature(CONFIG_ALTIVEC) &&
+		cpu_has_feature(CPU_FTR_ARCH_207S));
+# else
+	return 1;
+#endif
+
+}
+#endif
+
+const struct raid6_calls raid6_vpermxor$# = {
+	raid6_vpermxor$#_gen_syndrome,
+	NULL,
+	raid6_have_altivec_vpermxor,
+	"vpermxor$#",
+	0
+};
+#endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH v1] mdadm/grow: reshape would be stuck from raid1 to raid5
From: Zhilong Liu @ 2017-03-30  7:38 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <wrfjk27aid6n.fsf@gmail.com>

systemctl doesn't interpret mdadm-grow-continue@.service
correctly due to the wrong argument provided in [service],
it should be corrected %I as %i. Otherwise, if the service
cannot start by systemctl and the reshap progress would be
stuck all time when grows array from raid1 to raid5.

reproduce steps:
./mdadm -CR /dev/md0 -l1 -b internal -n2 /dev/loop[0-1]
./mdadm --grow /dev/md0 -l5 -n3 -a /dev/loop2

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 systemd/mdadm-grow-continue@.service | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/systemd/mdadm-grow-continue@.service b/systemd/mdadm-grow-continue@.service
index 5c667d2..882bc0b 100644
--- a/systemd/mdadm-grow-continue@.service
+++ b/systemd/mdadm-grow-continue@.service
@@ -10,7 +10,7 @@ Description=Manage MD Reshape on /dev/%I
 DefaultDependencies=no
 
 [Service]
-ExecStart=BINDIR/mdadm --grow --continue /dev/%I
+ExecStart=BINDIR/mdadm --grow --continue /dev/%i
 StandardInput=null
 StandardOutput=null
 StandardError=null
-- 
2.10.2


^ permalink raw reply related

* Re: [PATCH 2/2] mdadm.c: fix compile error "switch condition has boolean value"
From: Gioh Kim @ 2017-03-30  7:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: jes.sorensen, linux-raid, linux-kernel, Wol
In-Reply-To: <87lgrnu5k4.fsf@notabene.neil.brown.name>

On Thu, Mar 30, 2017 at 08:38:35AM +1100, NeilBrown wrote:
> On Thu, Mar 30 2017, jes.sorensen@gmail.com wrote:
> 
> > Gioh Kim <gi-oh.kim@profitbricks.com> writes:
> >> Remove a boolean expression in switch condition
> >> to prevent compile error of some compilers.
> >
> > Please be specific, which compile is unable to handle this?
> >
> >> Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
> >> ---
> >>  mdadm.c | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mdadm.c b/mdadm.c
> >> index 08ddcab..a98a051 100644
> >> --- a/mdadm.c
> >> +++ b/mdadm.c
> >> @@ -1905,11 +1905,11 @@ static int misc_list(struct mddev_dev *devlist,
> >>  			rv |= SetAction(dv->devname, c->action);
> >>  			continue;
> >>  		}
> >> -		switch(dv->devname[0] == '/') {
> >> -			case 0:
> >> +		switch(dv->devname[0]) {
> >> +			default:
> >>  				mdfd = open_dev(dv->devname);
> >>  				if (mdfd >= 0) break;
> >> -			case 1:
> >> +			case '/':
> >>  				mdfd = open_mddev(dv->devname, 1);  
> >>  		}
> >>  		if (mdfd>=0) {
> >
> > While I agree the original code is ugly, I am not convinced your
> > replacement is a lot prettier.
> >
> 
> Maybe
> 
> 		if (dv->devname[0] == '/' ||
> 		    (mdfd = open_dev(dv->devname)) < 0)
> 				mdfd = open_mddev(dv->devname, 1);
> 
> ??
> NeilBrown

Yes, the initial version I thought was:

if (dev->devname[0] != '/')
    mdfd = open_dev(dv->devname);
if (dev->devname[0] == '/' || mdfd < 0)
    mdfd = open_mddev(dv->devname, 1);

But I thought you have a reason to use switch-case expression,
so I kept switch-case.
If you agree, I'll send v2 patch using if-expression.

-- 
Best regards,
Gi-Oh Kim
TEL: 0176 2697 8962

^ permalink raw reply

* Re: [PATCH 01/23] block: renumber REQ_OP_WRITE_ZEROES
From: hch-jcswGhMUV9g @ 2017-03-30  8:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org,
	linux-raid-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	hch-jcswGhMUV9g@public.gmane.org,
	agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
In-Reply-To: <1490717553.2573.4.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

On Tue, Mar 28, 2017 at 04:12:46PM +0000, Bart Van Assche wrote:
> Since REQ_OP_WRITE_ZEROES was introduced in kernel v4.10, do we need
> "Cc: stable" and "Fixes: a6f0788ec2881" tags for this patch?

No.  This just works around the way scsi_setup_cmnd sets up the data
direction.  Before this series it's not an issue because no one used
the req_op data direction for setting up the dma direction.

^ permalink raw reply

* Re: [PATCH 11/23] block_dev: use blkdev_issue_zerout for hole punches
From: hch-jcswGhMUV9g @ 2017-03-30  8:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org,
	linux-raid-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	hch-jcswGhMUV9g@public.gmane.org,
	agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
In-Reply-To: <1490719834.2573.9.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

On Tue, Mar 28, 2017 at 04:50:47PM +0000, Bart Van Assche wrote:
> On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
> > This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
> > and preparse for removing the discard_zeroes_data flag.
> 
> Hello Christoph,
> 
> "preparse" probably should have been "prepare"?

Yes, fixed.

^ permalink raw reply

* Re: [PATCH 12/23] sd: handle REQ_UNMAP
From: hch-jcswGhMUV9g @ 2017-03-30  9:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org,
	linux-raid-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org,
	shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	hch-jcswGhMUV9g@public.gmane.org,
	agk-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ@public.gmane.org
In-Reply-To: <1490719722.2573.8.camel-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

On Tue, Mar 28, 2017 at 04:48:55PM +0000, Bart Van Assche wrote:
> >  	if (sdp->no_write_same)
> >  		return BLKPREP_INVALID;
> >  	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff)
> 
> Users can change the provisioning mode from user space from SD_LBP_WS16 into
> SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector >
> 0xffffffff || nr_sectors > 0xffff) check if REQ_UNMAP is set.

They can, and if the device has too many sectors that will already cause
discard to fail, and in this case it will cause write zeroes to fail as
well.  The intent behind this patch is to keep the behavior the same
as the old path that uses discards for zeroing.  The logic looks a bit
clumsy, but I'd rather keep it as-is.

^ permalink raw reply

* Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Christoph Hellwig @ 2017-03-30  9:04 UTC (permalink / raw)
  To: Christoph Hellwig, axboe-tSWWG44O7X1aa/9Udqfwiw,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	shli-DgEjT+Ai2ygdnm+yROfE0A,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20170323155410.GD1138-w1SgEEioFePxa46PmUWvFg@public.gmane.org>

Lars, can you please take a look a patch 22 and check if it's safe?

That's the big thing I want to know before posting the next version
of the series.  If it's not safe I'd like to drop that patch.

^ permalink raw reply

* Re: [PATCH 23/23] block: remove the discard_zeroes_data flag
From: hch @ 2017-03-30  9:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org,
	linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490720411.2573.11.camel@sandisk.com>

On Tue, Mar 28, 2017 at 05:00:48PM +0000, Bart Van Assche wrote:
> 
> It seems to me like the documentation in Documentation/ABI/testing/sysfs-block
> and the above code are not in sync. I think the above code will cause reading
> from the discard_zeroes_data attribute to return an empty string ("") instead
> of "0\n".

Thanks, fine with me.

> 
> BTW, my personal preference is to remove this attribute entirely because keeping
> it will cause confusion, no matter how well we document the behavior of this
> attribute.

Jens, any opinion?  I'd like to remove it too, but I fear it might
break things.  We could deprecate it first with a warning when read
and then remove it a few releases down the road.

^ permalink raw reply

* Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
From: Lars Ellenberg @ 2017-03-30 10:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, shli-DgEjT+Ai2ygdnm+yROfE0A,
	agk-H+wXaHxf7aLQT0dZR+AlfA, drbd-dev-cunTk1MwBs8qoQakbn7OcQ
In-Reply-To: <20170323143341.31549-23-hch-jcswGhMUV9g@public.gmane.org>

On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote:
> It seems like DRBD assumes its on the wire TRIM request always zeroes data.
> Use that fact to implement REQ_OP_WRITE_ZEROES.
> 
> XXX: will need a careful audit from the drbd team!

Thanks, this one looks ok to me.

The real question for me is, will the previous one (21/23)
return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than
what we have now?  Will it make an fstrim cause thinly provisioned
devices to suddenly be fully allocated?
Or does it unmap "the same" as what we have now?
Especially on top of dm-thin, but also on top of any other device.
That's something that is not really "obvious" to me yet.

Cheers,
    Lars

^ permalink raw reply

* Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
From: Christoph Hellwig @ 2017-03-30 11:44 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, linux-block, linux-scsi, drbd-dev, dm-devel,
	linux-raid
In-Reply-To: <20170330100641.GI5939@soda.linbit>

On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote:
> On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote:
> > It seems like DRBD assumes its on the wire TRIM request always zeroes data.
> > Use that fact to implement REQ_OP_WRITE_ZEROES.
> > 
> > XXX: will need a careful audit from the drbd team!
> 
> Thanks, this one looks ok to me.

So the DRBD protocol requires the TRIM operation to always zero?

> The real question for me is, will the previous one (21/23)
> return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than
> what we have now?

No, blkdev_issue_zeroout should never return -EOPNOTSUPP.

> Will it make an fstrim cause thinly provisioned
> devices to suddenly be fully allocated?

Not for SCSI devices.  Yes for dm-thinp until it implements
REQ_OP_WRITE_ZEROES, which will hopefully be soon.

^ permalink raw reply

* Re: [Drbd-dev] [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
From: Lars Ellenberg @ 2017-03-30 12:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, agk, snitzer, shli, philipp.reisner,
	linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170330114408.GA15777@lst.de>

On Thu, Mar 30, 2017 at 01:44:09PM +0200, Christoph Hellwig wrote:
> On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote:
> > On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote:
> > > It seems like DRBD assumes its on the wire TRIM request always zeroes data.
> > > Use that fact to implement REQ_OP_WRITE_ZEROES.
> > > 
> > > XXX: will need a careful audit from the drbd team!
> > 
> > Thanks, this one looks ok to me.
> 
> So the DRBD protocol requires the TRIM operation to always zero?

"users" (both as in submitting entities, and people using DRBD)
expect that DRBD guarantees replicas to be identical after whatever
operations have been completed by all replicas.

Which means that for trim/discard/unmap, we can only expose that to
upper layers (or use it for internal purposes) if the operation has
a well defined, and on all backends identical, result.

Short answer: Yes.

> > The real question for me is, will the previous one (21/23)
> > return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than
> > what we have now?
> 
> No, blkdev_issue_zeroout should never return -EOPNOTSUPP.
> 
> > Will it make an fstrim cause thinly provisioned
> > devices to suddenly be fully allocated?
> 
> Not for SCSI devices.  Yes for dm-thinp until it implements
> REQ_OP_WRITE_ZEROES, which will hopefully be soon.

"good enough for me" ;-)

Thanks,

    Lars

^ permalink raw reply

* Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
From: Mike Snitzer @ 2017-03-30 13:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, shli-DgEjT+Ai2ygdnm+yROfE0A,
	agk-H+wXaHxf7aLQT0dZR+AlfA, drbd-dev-cunTk1MwBs8qoQakbn7OcQ
In-Reply-To: <20170330114408.GA15777-jcswGhMUV9g@public.gmane.org>

On Thu, Mar 30 2017 at  7:44am -0400,
Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:

> On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote:
> 
> > Will it make an fstrim cause thinly provisioned
> > devices to suddenly be fully allocated?
> 
> Not for SCSI devices.  Yes for dm-thinp until it implements
> REQ_OP_WRITE_ZEROES, which will hopefully be soon.

I can work on this now.  Only question I have is: should DM thinp take
care to zero any misaligned head and tail?  (I assume so but with all
the back and forth between Bart, Paolo and Martin I figured I'd ask
explicitly).

^ permalink raw reply

* Re: Grow set size issue
From: Jes Sorensen @ 2017-03-30 14:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87o9wju5r0.fsf@notabene.neil.brown.name>

On 03/29/2017 05:34 PM, NeilBrown wrote:
> On Thu, Mar 30 2017, jes.sorensen@gmail.com wrote:
>
>> Hi Neil,
>>
>> In the below patch you changed the error handling, to make the kernel
>> not setting the size of the device being an error. However we still have
>> the code in place to handle the error, except it never triggers.
>
> So we do.  I should have removed all of that.
> I should have just reverted
> Commit: 65a9798b58b4 ("FIX: Detect error and rollback metadata")
>
>
>>
>> Question is do you remember the reason for this change? Old kernels not
>> allowing it, are there any legitimate reasons for the kernel to refuse
>> the size change?
>
> I needed to go further back to remind myself why we do these size change
> at all.
> The command being run here is "mdadm --grow /dev/mdX --size=foo"
> which has a primary purpose of changing the component_size of the array.
> What can happen is that someone makes all the components bigger
> (E.g. with LVM) and then uses this command to set --size=max, and it
> doesn't work.  That is because md doesn't know the devices are bigger.
>
> You can tell md that devices have changes size by writing to the "size"
> attribute.
> mdadm doesn't have an option for doing that per-device - you need to
> poke into sysfs.
>
> To make it a bit easier, when you use "--grow --size=foo", mdadm will
> always write "foo" to the "size" attribute of each device, just incase
> that will be helpful. In the case where the device is now bigger, it
> will.
>
> In the case where the size of the array is being reduced, it is not
> permitted to change the "size" of each device until the "component_size"
> of the array has changed, so these attempts to change "size" will fail.
> But that isn't a problem.
>
> In short, the attempt to change "size" here is a convenience, and
> optimization.  It doesn't matter if it fails.
>
> So please just revert all bits of the above commit that are still
> present.

Hi Neil,

Thanks for the explanation, I'll take the big hammer to the leftovers 
and get rid of them.

Cheers,
Jes



^ permalink raw reply

* [PATCH] imsm: use rounded size for metadata initialization
From: Tomasz Majchrzak @ 2017-03-30 14:25 UTC (permalink / raw)
  To: linux-raid; +Cc: jes.sorensen, Tomasz Majchrzak

Array size is rounded to the nearest MB, however number of data stripes
and blocks per disk are calculated using size passed by the user. If
given size is not aligned, there is a mismatch. It's not possible to
assemble raid0 migrated to raid5 since raid5 arrays use number of data
stripes to calculate array size.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 52 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 18 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index 785488a..84dfe2b 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -264,6 +264,8 @@ struct bbm_log {
 static char *map_state_str[] = { "normal", "uninitialized", "degraded", "failed" };
 #endif
 
+#define BLOCKS_PER_KB	(1024/512)
+
 #define RAID_DISK_RESERVED_BLOCKS_IMSM_HI 2209
 
 #define GEN_MIGR_AREA_SIZE 2048 /* General Migration Copy Area size in blocks */
@@ -1324,6 +1326,19 @@ static int is_journal(struct imsm_disk *disk)
 	return (disk->status & JOURNAL_DISK) == JOURNAL_DISK;
 }
 
+/* round array size down to closest MB and ensure it splits evenly
+ * between members
+ */
+static unsigned long long round_size_to_mb(unsigned long long size, unsigned int
+					   disk_count)
+{
+	size /= disk_count;
+	size = (size >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
+	size *= disk_count;
+
+	return size;
+}
+
 /* try to determine how much space is reserved for metadata from
  * the last get_extents() entry on the smallest active disk,
  * otherwise fallback to the default
@@ -3330,11 +3345,10 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 			if (used_disks > 0) {
 				array_blocks = blocks_per_member(map) *
 					used_disks;
-				/* round array size down to closest MB
-				 */
-				info->custom_array_size = (array_blocks
-						>> SECT_PER_MB_SHIFT)
-						<< SECT_PER_MB_SHIFT;
+				info->custom_array_size =
+					round_size_to_mb(array_blocks,
+							 used_disks);
+
 			}
 		}
 		case MIGR_VERIFY:
@@ -5241,6 +5255,8 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	unsigned long long array_blocks;
 	size_t size_old, size_new;
 	unsigned long long num_data_stripes;
+	unsigned int data_disks;
+	unsigned long long size_per_member;
 
 	if (super->orom && mpb->num_raid_devs >= super->orom->vpa) {
 		pr_err("This imsm-container already has the maximum of %d volumes\n", super->orom->vpa);
@@ -5317,9 +5333,11 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	strncpy((char *) dev->volume, name, MAX_RAID_SERIAL_LEN);
 	array_blocks = calc_array_size(info->level, info->raid_disks,
 					       info->layout, info->chunk_size,
-					       s->size * 2);
-	/* round array size down to closest MB */
-	array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
+					       s->size * BLOCKS_PER_KB);
+	data_disks = get_data_disks(info->level, info->layout,
+				    info->raid_disks);
+	array_blocks = round_size_to_mb(array_blocks, data_disks);
+	size_per_member = array_blocks / data_disks;
 
 	dev->size_low = __cpu_to_le32((__u32) array_blocks);
 	dev->size_high = __cpu_to_le32((__u32) (array_blocks >> 32));
@@ -5331,7 +5349,9 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	vol->curr_migr_unit = 0;
 	map = get_imsm_map(dev, MAP_0);
 	set_pba_of_lba0(map, super->create_offset);
-	set_blocks_per_member(map, info_to_blocks_per_member(info, s->size));
+	set_blocks_per_member(map, info_to_blocks_per_member(info,
+							     size_per_member /
+							     BLOCKS_PER_KB));
 	map->blocks_per_strip = __cpu_to_le16(info_to_blocks_per_strip(info));
 	map->failed_disk_num = ~0;
 	if (info->level > 0)
@@ -5359,7 +5379,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 		map->num_domains = 1;
 
 	/* info->size is only int so use the 'size' parameter instead */
-	num_data_stripes = (s->size * 2) / info_to_blocks_per_strip(info);
+	num_data_stripes = size_per_member / info_to_blocks_per_strip(info);
 	num_data_stripes /= map->num_domains;
 	set_num_data_stripes(map, num_data_stripes);
 
@@ -7981,9 +8001,7 @@ static unsigned long long imsm_set_array_size(struct imsm_dev *dev,
 		array_blocks = new_size;
 	}
 
-	/* round array size down to closest MB
-	 */
-	array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
+	array_blocks = round_size_to_mb(array_blocks, used_disks);
 	dev->size_low = __cpu_to_le32((__u32)array_blocks);
 	dev->size_high = __cpu_to_le32((__u32)(array_blocks >> 32));
 
@@ -8096,11 +8114,9 @@ static int imsm_set_array_state(struct active_array *a, int consistent)
 					array_blocks =
 						blocks_per_member(map) *
 						used_disks;
-					/* round array size down to closest MB
-					 */
-					array_blocks = (array_blocks
-							>> SECT_PER_MB_SHIFT)
-						<< SECT_PER_MB_SHIFT;
+					array_blocks =
+						round_size_to_mb(array_blocks,
+								 used_disks);
 					a->info.custom_array_size = array_blocks;
 					/* encourage manager to update array
 					 * size
-- 
1.8.3.1


^ permalink raw reply related

* Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Mike Snitzer @ 2017-03-30 15:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, martin.petersen, agk, shli, philipp.reisner,
	lars.ellenberg, linux-block, linux-raid, dm-devel, linux-scsi,
	drbd-dev
In-Reply-To: <20170323143341.31549-1-hch@lst.de>

Would be very useful, particularly for testing, if
drivers/scsi/scsi_debug.c were updated to support WRITE ZEROES.

^ permalink raw reply

* [PATCH][RFC] dm raid: Fix NULL pointer dereference for raid1 without bitmap
From: kmeaw @ 2017-03-30 15:14 UTC (permalink / raw)
  To: linux-kernel, dm-devel, linux-raid
  Cc: agk, shli,
	Андрей Сметанин

From: Dmitry Bilunov <kmeaw@yandex-team.ru>

4257e08 introduces a bitmap resize call during preresume phase. User
can create a DM device with "raid" target configured as raid1 with no
metadata devices to hold superblock/bitmap info. It can be achieved
using the following sequence:

  truncate -s 32M /dev/shm/raid-test
  LOOP=$(losetup --show -f /dev/shm/raid-test)
  dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0"
  dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024"
  dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1"

It results in

Killed
[ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB
[ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors
[ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.114816] PGD 0
…
[ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015
[ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000
[ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246
[ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000
[ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000
[ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff
[ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058
[ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 4029.115223] FS:  00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000
[ 4029.115245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0
[ 4029.115281] Call Trace:
[ 4029.115291]  ? raid_iterate_devices+0x63/0x80 [dm_raid]
[ 4029.115309]  ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod]
[ 4029.115329]  ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod]
[ 4029.115346]  raid_preresume+0x81/0x2e0 [dm_raid]
[ 4029.115361]  dm_table_resume_targets+0x47/0xe0 [dm_mod]
[ 4029.115378]  dm_resume+0xa8/0xd0 [dm_mod]
[ 4029.115391]  dev_suspend+0x123/0x250 [dm_mod]
[ 4029.115405]  ? table_load+0x350/0x350 [dm_mod]
[ 4029.115419]  ctl_ioctl+0x1c2/0x490 [dm_mod]
[ 4029.115433]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
[ 4029.115447]  do_vfs_ioctl+0x8d/0x5a0
[ 4029.115459]  ? ____fput+0x9/0x10
[ 4029.115470]  ? task_work_run+0x79/0xa0
[ 4029.115481]  SyS_ioctl+0x3c/0x70
[ 4029.115493]  entry_SYSCALL_64_fastpath+0x13/0x94

Function raid_preresume takes an assumption that raid_set has a bitmap enabled.
Call to bitmap_load(&rs->md) inside __load_dirty_region_bitmap would return 0 even
if there is no bitmap present.

This patch avoids resizing an absent bitmap.

Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru>
Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru>
---
 drivers/md/dm-raid.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index f8564d63..67a5405e 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3727,6 +3727,7 @@ static int raid_preresume(struct dm_target *ti)
 
 	/* Resize bitmap to adjust to changed region size (aka MD bitmap chunksize) */
 	if (test_bit(RT_FLAG_RS_BITMAP_LOADED, &rs->runtime_flags) &&
+	    mddev->bitmap &&
 	    mddev->bitmap_info.chunksize != to_bytes(rs->requested_bitmap_chunk_sectors)) {
 		r = bitmap_resize(mddev->bitmap, mddev->dev_sectors,
 				  to_bytes(rs->requested_bitmap_chunk_sectors), 0);
-- 
2.12.2

^ permalink raw reply related

* Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
From: Martin K. Petersen @ 2017-03-30 15:20 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	martin.petersen-QHcLZuEGTsvQT0dZR+AlfA,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, shli-DgEjT+Ai2ygdnm+yROfE0A,
	Christoph Hellwig, agk-H+wXaHxf7aLQT0dZR+AlfA,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ
In-Reply-To: <20170330134957.GA508-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Mike Snitzer <snitzer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> I can work on this now.  Only question I have is: should DM thinp take
> care to zero any misaligned head and tail?  (I assume so but with all
> the back and forth between Bart, Paolo and Martin I figured I'd ask
> explicitly).

Yep, let's make sure our semantics match the hardware ditto.

 - So write zeroes should behave deterministically and explicitly handle
   any blocks that can't be cleared via deprovisioning.

 - And discard can work at the discard granularity in a
   non-deterministic fashion.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Martin K. Petersen @ 2017-03-30 15:22 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Christoph Hellwig, axboe, martin.petersen, agk, shli,
	philipp.reisner, lars.ellenberg, linux-block, linux-raid,
	dm-devel, linux-scsi, drbd-dev
In-Reply-To: <20170330151257.GA910@redhat.com>

Mike Snitzer <snitzer@redhat.com> writes:

> Would be very useful, particularly for testing, if
> drivers/scsi/scsi_debug.c were updated to support WRITE ZEROES.

There is no WRITE ZEROES in SCSI. You should be able to get the right
behavior with lbpws=1 lbprz=1.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox