Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: RAID6 rebuild oddity
From: Brad Campbell @ 2017-03-29  8:12 UTC (permalink / raw)
  To: NeilBrown, Linux-RAID
In-Reply-To: <87zig467z3.fsf@notabene.neil.brown.name>

On 29/03/17 12:08, NeilBrown wrote:

>
> sdj is getting twice the utilization of the others but only about 10%
> more rKB/sec.  That suggests lots of seeking.

Yes, something is not entirely sequential.

> Does "fuser /dev/sdj" report anything funny?

No. No output. As far as I can tell nothing should be touching the disks 
other than the md kernel thread.

> Is there filesystem IO happening? If so, what filesystem?
> Have you told the filesystem about the RAID layout?
> Maybe the filesystem keeps reading some index blocks that are always on
> the same drive.

No. I probably wasn't as clear as I should have been in the initial 
post. There was nothing mounted at the time.

Right now the array contains one large LUKS container (dm-crypt). This 
was mounted and a continuous dd done to the dm device to zero it out :

4111195+1 records in
4111195+1 records out
34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s

So there is no filesystem on the drive.

I failed and removed sdi :

root@test:~# mdadm --fail /dev/md0 /dev/sdi
mdadm: set /dev/sdi faulty in /dev/md0
root@test:~# mdadm --remove /dev/md0 /dev/sdi
mdadm: hot removed /dev/sdi from /dev/md0
root@test:~# mdadm --zero-superblock /dev/sdi
root@test:~# mdadm --add /dev/md0 /dev/sdi
mdadm: added /dev/sdi

Rebooted the machine to remove all tweaks of things like stripe cache 
size, readahead, NCQ and anything else.

I opened the LUKS container, dd'd a meg to the start to write to the 
array and kick off the resync, then closed the LUKS container. At this 
point dm should no longer be touching the drive and I've verified the 
device has gone.

I then ran sync a couple of times and waited a couple of minutes until I 
was positive _nothing_ was touching md0, then ran :

blktrace -w 5 /dev/sd[bcdefgij] /dev/md0

> So the problem moves from drive to drive?  Strongly suggests filesystem
> metadata access to me.

Again, sorry for me not being clear. The situation changes on a resync 
specific basis. For example the reproduction I've done now I popped out 
sdi rather than sdb, and now the bottleneck is sdg. It is the same if 
the exact circumstances remain the same.

>
> If you can capture several seconds of trace on all drives plus the
> array, compress it and host it somewhere, I can pick it up and have
> look.

I've captured 5 seconds. I was overly optimistic initially and tried 20 
seconds, but that resulted in 100M bzipped.

This is 19M. The kernel is a clean 4.10.6 compiled up half an hour ago 
as I forgot to include block tracing in there.

http://www.fnarfbargle.com/private/170329-Resync/resync-blktrace.tar.bz2

root@test:~/bench# mdadm --detail /dev/md0
/dev/md0:
         Version : 1.2
   Creation Time : Wed Mar 22 14:01:41 2017
      Raid Level : raid6
      Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
   Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
    Raid Devices : 8
   Total Devices : 8
     Persistence : Superblock is persistent

     Update Time : Wed Mar 29 16:08:20 2017
           State : clean, degraded, recovering
  Active Devices : 7
Working Devices : 8
  Failed Devices : 0
   Spare Devices : 1

          Layout : left-symmetric
      Chunk Size : 64K

  Rebuild Status : 2% complete

            Name : test:0  (local to host test)
            UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
          Events : 715

     Number   Major   Minor   RaidDevice State
        8       8       16        0      active sync   /dev/sdb
        1       8       32        1      active sync   /dev/sdc
        2       8       48        2      active sync   /dev/sdd
        3       8       64        3      active sync   /dev/sde
        4       8       80        4      active sync   /dev/sdf
        5       8       96        5      active sync   /dev/sdg
        9       8      128        6      spare rebuilding   /dev/sdi
        7       8      144        7      active sync   /dev/sdj


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.08    0.00    4.77    4.75    0.00   90.41

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.40    0.00    3.00     0.00    10.60 
7.07     0.01    4.67    0.00    4.67   4.67   1.40
sdb           14432.20     0.00  120.20    0.00 57295.20     0.00 
953.33     2.04   16.71   16.71    0.00   3.23  38.80
sdc           14432.20     0.00  120.20    0.00 57295.20     0.00 
953.33     1.99   16.26   16.26    0.00   3.21  38.60
sdd           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.14   17.55   17.55    0.00   3.18  38.40
sde           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.02   16.57   16.57    0.00   3.20  38.60
sdf           14432.20     0.00  120.80    0.00 57602.40     0.00 
953.68     2.08   17.04   17.04    0.00   3.25  39.20
sdg           16135.40     0.00  224.60    0.00 65811.20     0.00 
586.03   126.46  575.26  575.26    0.00   4.45 100.00
sdh               0.00     1.40    0.00    3.00     0.00    10.60 
7.07     0.02    6.67    0.00    6.67   6.67   2.00
sdi               0.00 14982.60    0.00  123.20     0.00 59801.60 
970.81     4.52   35.57    0.00   35.57   3.25  40.00
sdj           14432.40     0.00  120.80    0.00 57603.20     0.00 
953.70     1.84   15.07   15.07    0.00   3.06  37.00
md1               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    2.20     0.00     8.80 
8.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00


Thanks for having a look. It seems odd to me, but I can't figure it out.

Regards,
Brad

^ permalink raw reply

* [PATCH 1/2] super1: replace hard-coded values with bit definitions
From: Gioh Kim @ 2017-03-29  9:40 UTC (permalink / raw)
  To: jes.sorensen; +Cc: neilb, linux-raid, linux-kernel, Gioh Kim

Some hard-coded values for disk status are replaced
with bit definitions.

Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
---
 super1.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/super1.c b/super1.c
index f3520ac..e8b9f61 100644
--- a/super1.c
+++ b/super1.c
@@ -1010,7 +1010,7 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 		info->disk.state = 0; /* spare: not active, not sync, not faulty */
 		break;
 	case MD_DISK_ROLE_FAULTY:
-		info->disk.state = 1; /* faulty */
+		info->disk.state = (1 << MD_DISK_FAULTY); /* faulty */
 		break;
 	case MD_DISK_ROLE_JOURNAL:
 		info->disk.state = (1 << MD_DISK_JOURNAL);
@@ -1503,11 +1503,12 @@ static int add_to_super1(struct supertype *st, mdu_disk_info_t *dk,
 	}
 
 	dk_state = dk->state & ~(1<<MD_DISK_FAILFAST);
-	if ((dk_state & 6) == 6) /* active, sync */
+	if ((dk_state & (1<<MD_DISK_ACTIVE)) &&
+	    (dk_state & (1<<MD_DISK_SYNC)))/* active, sync */
 		*rp = __cpu_to_le16(dk->raid_disk);
 	else if (dk_state & (1<<MD_DISK_JOURNAL))
                 *rp = MD_DISK_ROLE_JOURNAL;
-	else if ((dk_state & ~2) == 0) /* active or idle -> spare */
+	else if ((dk_state & ~(1<<MD_DISK_ACTIVE)) == 0) /* active or idle -> spare */
 		*rp = MD_DISK_ROLE_SPARE;
 	else
 		*rp = MD_DISK_ROLE_FAULTY;
-- 
2.5.0


^ permalink raw reply related

* [PATCH 2/2] mdadm.c: fix compile error "switch condition has boolean value"
From: Gioh Kim @ 2017-03-29  9:40 UTC (permalink / raw)
  To: jes.sorensen; +Cc: neilb, linux-raid, linux-kernel, Gioh Kim
In-Reply-To: <1490780434-8720-1-git-send-email-gi-oh.kim@profitbricks.com>

Remove a boolean expression in switch condition
to prevent compile error of some compilers.

Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
---
 mdadm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mdadm.c b/mdadm.c
index 08ddcab..a98a051 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1905,11 +1905,11 @@ static int misc_list(struct mddev_dev *devlist,
 			rv |= SetAction(dv->devname, c->action);
 			continue;
 		}
-		switch(dv->devname[0] == '/') {
-			case 0:
+		switch(dv->devname[0]) {
+			default:
 				mdfd = open_dev(dv->devname);
 				if (mdfd >= 0) break;
-			case 1:
+			case '/':
 				mdfd = open_mddev(dv->devname, 1);  
 		}
 		if (mdfd>=0) {
-- 
2.5.0


^ permalink raw reply related

* [PATCH v4 0/6] mdadm support for Partial Parity Log
From: Artur Paszkiewicz @ 2017-03-29  9:54 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz

This is the mdadm part of the PPL functionality. It adds a new parameter to
mdadm to allow selecting the consistency policy for an array and implements PPL
as one of the policies.

All of this is currently targeted and tested with imsm and super1 metadata
arrays. The kernel patches are available on Shaohua's md tree on the for-next
branch.

v4:
- Rebased to latest mdadm code.
- Fixed a mistake in Makefile.
- Added descriptions for --consistency-policy in --help messages.

v3:
- Rebased to latest mdadm code.
- Replaced --rwh-policy with --consistency-policy.
- Adjusted to the kernel interface changes.
- Write the initial ppl header in mdadm when creating an array.
- Validate ppl when starting an imsm array. 
- Support for --update.
- Support also 1.0 metadata.
- Use Grow mode for enabing or disabling ppl for an active array.

v2:
- Rebased to latest mdadm code.
- Fixed runtime policy switching.
- Recognize IMSM PPL journal drive correctly.
- Removed bitfields from imsm_dev.
- Made dirty/consistent checks in imsm_set_array_state() more readable.
- Added comment about calculating sb_start.

Artur Paszkiewicz (6):
  Generic support for --consistency-policy and PPL
  Detail: show consistency policy
  imsm: PPL support
  super1: PPL support
  Add 'ppl' and 'no-ppl' options for --update=
  Grow: support consistency policy change

 Assemble.c    |  58 ++++++++++
 Create.c      |  20 +++-
 Detail.c      |  90 +++++++++------
 Grow.c        | 187 ++++++++++++++++++++++++++++++-
 Incremental.c |   3 +
 Kill.c        |   2 +-
 Makefile      |   5 +-
 ReadMe.c      |  97 ++++++++--------
 maps.c        |  10 ++
 md_p.h        |  25 +++++
 mdadm.8.in    |  85 ++++++++++++--
 mdadm.c       |  64 ++++++++++-
 mdadm.h       |  30 ++++-
 super-ddf.c   |  12 +-
 super-gpt.c   |   2 +-
 super-intel.c | 347 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
 super-mbr.c   |   2 +-
 super0.c      |  12 +-
 super1.c      | 214 +++++++++++++++++++++++++++++++-----
 sysfs.c       |  25 +++++
 20 files changed, 1116 insertions(+), 174 deletions(-)

-- 
2.11.0


^ permalink raw reply

* [PATCH v4 1/6] Generic support for --consistency-policy and PPL
From: Artur Paszkiewicz @ 2017-03-29  9:54 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170329095420.17855-1-artur.paszkiewicz@intel.com>

Add a new parameter to mdadm: --consistency-policy=. It determines how
the array maintains consistency in case of unexpected shutdown. This
maps to the md sysfs attribute 'consistency_policy'. It can be used to
create a raid5 array using PPL. Add the necessary plumbing to pass this
option to metadata handlers. The write journal and bitmap
functionalities are treated as different policies, which are implicitly
selected when using --write-journal or --bitmap options.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Create.c      | 18 ++++++++++++++----
 Kill.c        |  2 +-
 ReadMe.c      | 51 +++++++++++++++++++++++++++------------------------
 maps.c        | 10 ++++++++++
 mdadm.8.in    | 40 +++++++++++++++++++++++++++++++++++++---
 mdadm.c       | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 mdadm.h       | 21 ++++++++++++++++++---
 super-ddf.c   |  6 +++---
 super-gpt.c   |  2 +-
 super-intel.c | 16 ++++++++--------
 super-mbr.c   |  2 +-
 super0.c      |  8 ++++----
 super1.c      |  6 +++---
 sysfs.c       | 11 +++++++++++
 14 files changed, 190 insertions(+), 58 deletions(-)

diff --git a/Create.c b/Create.c
index 2721884e..4080bf69 100644
--- a/Create.c
+++ b/Create.c
@@ -259,7 +259,8 @@ int Create(struct supertype *st, char *mddev,
 	if (st && ! st->ss->validate_geometry(st, s->level, s->layout, s->raiddisks,
 					      &s->chunk, s->size*2,
 					      data_offset, NULL,
-					      &newsize, c->verbose>=0))
+					      &newsize, s->consistency_policy,
+					      c->verbose>=0))
 		return 1;
 
 	if (s->chunk && s->chunk != UnSet) {
@@ -358,7 +359,8 @@ int Create(struct supertype *st, char *mddev,
 						st, s->level, s->layout, s->raiddisks,
 						&s->chunk, s->size*2,
 						dv->data_offset, dname,
-						&freesize, c->verbose > 0)) {
+						&freesize, s->consistency_policy,
+						c->verbose > 0)) {
 				case -1: /* Not valid, message printed, and not
 					  * worth checking any further */
 					exit(2);
@@ -395,6 +397,7 @@ int Create(struct supertype *st, char *mddev,
 						       &s->chunk, s->size*2,
 						       dv->data_offset,
 						       dname, &freesize,
+						       s->consistency_policy,
 						       c->verbose >= 0)) {
 
 				pr_err("%s is not suitable for this array.\n",
@@ -501,7 +504,8 @@ int Create(struct supertype *st, char *mddev,
 						       s->raiddisks,
 						       &s->chunk, minsize*2,
 						       data_offset,
-						       NULL, NULL, 0)) {
+						       NULL, NULL,
+						       s->consistency_policy, 0)) {
 				pr_err("devices too large for RAID level %d\n", s->level);
 				return 1;
 			}
@@ -528,6 +532,12 @@ int Create(struct supertype *st, char *mddev,
 	if (s->bitmap_file && strcmp(s->bitmap_file, "none") == 0)
 		s->bitmap_file = NULL;
 
+	if (s->consistency_policy == CONSISTENCY_POLICY_PPL &&
+	    !st->ss->write_init_ppl) {
+		pr_err("%s metadata does not support PPL\n", st->ss->name);
+		return 1;
+	}
+
 	if (!have_container && s->level > 0 && ((maxsize-s->size)*100 > maxsize)) {
 		if (c->runstop != 1 || c->verbose >= 0)
 			pr_err("largest drive (%s) exceeds size (%lluK) by more than 1%%\n",
@@ -720,7 +730,7 @@ int Create(struct supertype *st, char *mddev,
 				name += 2;
 		}
 	}
-	if (!st->ss->init_super(st, &info.array, s->size, name, c->homehost, uuid,
+	if (!st->ss->init_super(st, &info.array, s, name, c->homehost, uuid,
 				data_offset))
 		goto abort_locked;
 
diff --git a/Kill.c b/Kill.c
index f2fdb856..ff52561d 100644
--- a/Kill.c
+++ b/Kill.c
@@ -63,7 +63,7 @@ int Kill(char *dev, struct supertype *st, int force, int verbose, int noexcl)
 	rv = st->ss->load_super(st, fd, dev);
 	if (rv == 0 || (force && rv >= 2)) {
 		st->ss->free_super(st);
-		st->ss->init_super(st, NULL, 0, "", NULL, NULL,
+		st->ss->init_super(st, NULL, NULL, "", NULL, NULL,
 				   INVALID_SECTORS);
 		if (st->ss->store_super(st, fd)) {
 			if (verbose >= 0)
diff --git a/ReadMe.c b/ReadMe.c
index 50d38076..fc04c2c3 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -78,11 +78,11 @@ char Version[] = "mdadm - v" VERSION " - " VERS_DATE "\n";
  *     found, it is started.
  */
 
-char short_options[]="-ABCDEFGIQhVXYWZ:vqbc:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:";
+char short_options[]="-ABCDEFGIQhVXYWZ:vqbc:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:k:";
 char short_bitmap_options[]=
-		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:";
+		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:k:";
 char short_bitmap_auto_options[]=
-		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sa:rfRSow1tye:";
+		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sa:rfRSow1tye:k:";
 
 struct option long_options[] = {
     {"manage",    0, 0, ManageOpt},
@@ -148,6 +148,7 @@ struct option long_options[] = {
     {"nodes",1, 0, Nodes}, /* also for --assemble */
     {"home-cluster",1, 0, ClusterName},
     {"write-journal",1, 0, WriteJournal},
+    {"consistency-policy", 1, 0, 'k'},
 
     /* For assemble */
     {"uuid",      1, 0, 'u'},
@@ -362,27 +363,29 @@ char Help_create[] =
 " other levels.\n"
 "\n"
 " Options that are valid with --create (-C) are:\n"
-"  --bitmap=          : Create a bitmap for the array with the given filename\n"
-"                     : or an internal bitmap is 'internal' is given\n"
-"  --chunk=      -c   : chunk size in kibibytes\n"
-"  --rounding=        : rounding factor for linear array (==chunk size)\n"
-"  --level=      -l   : raid level: 0,1,4,5,6,10,linear,multipath and synonyms\n"
-"  --parity=     -p   : raid5/6 parity algorithm: {left,right}-{,a}symmetric\n"
-"  --layout=          : same as --parity, for RAID10: [fno]NN \n"
-"  --raid-devices= -n : number of active devices in array\n"
-"  --spare-devices= -x: number of spare (eXtra) devices in initial array\n"
-"  --size=       -z   : Size (in K) of each drive in RAID1/4/5/6/10 - optional\n"
-"  --data-offset=     : Space to leave between start of device and start\n"
-"                     : of array data.\n"
-"  --force       -f   : Honour devices as listed on command line.  Don't\n"
-"                     : insert a missing drive for RAID5.\n"
-"  --run         -R   : insist of running the array even if not all\n"
-"                     : devices are present or some look odd.\n"
-"  --readonly    -o   : start the array readonly - not supported yet.\n"
-"  --name=       -N   : Textual name for array - max 32 characters\n"
-"  --bitmap-chunk=    : bitmap chunksize in Kilobytes.\n"
-"  --delay=      -d   : bitmap update delay in seconds.\n"
-"  --write-journal=   : Specify journal device for RAID-4/5/6 array\n"
+"  --bitmap=          -b : Create a bitmap for the array with the given filename\n"
+"                        : or an internal bitmap if 'internal' is given\n"
+"  --chunk=           -c : chunk size in kibibytes\n"
+"  --rounding=           : rounding factor for linear array (==chunk size)\n"
+"  --level=           -l : raid level: 0,1,4,5,6,10,linear,multipath and synonyms\n"
+"  --parity=          -p : raid5/6 parity algorithm: {left,right}-{,a}symmetric\n"
+"  --layout=             : same as --parity, for RAID10: [fno]NN \n"
+"  --raid-devices=    -n : number of active devices in array\n"
+"  --spare-devices=   -x : number of spare (eXtra) devices in initial array\n"
+"  --size=            -z : Size (in K) of each drive in RAID1/4/5/6/10 - optional\n"
+"  --data-offset=        : Space to leave between start of device and start\n"
+"                        : of array data.\n"
+"  --force            -f : Honour devices as listed on command line.  Don't\n"
+"                        : insert a missing drive for RAID5.\n"
+"  --run              -R : insist of running the array even if not all\n"
+"                        : devices are present or some look odd.\n"
+"  --readonly         -o : start the array readonly - not supported yet.\n"
+"  --name=            -N : Textual name for array - max 32 characters\n"
+"  --bitmap-chunk=       : bitmap chunksize in Kilobytes.\n"
+"  --delay=           -d : bitmap update delay in seconds.\n"
+"  --write-journal=      : Specify journal device for RAID-4/5/6 array\n"
+"  --consistency-policy= : Specify the policy that determines how the array\n"
+"                     -k : maintains consistency in case of unexpected shutdown.\n"
 "\n"
 ;
 
diff --git a/maps.c b/maps.c
index 64f1df2c..d9ee7de4 100644
--- a/maps.c
+++ b/maps.c
@@ -129,6 +129,16 @@ mapping_t faultylayout[] = {
 	{ NULL, 0}
 };
 
+mapping_t consistency_policies[] = {
+	{ "unknown", CONSISTENCY_POLICY_UNKNOWN},
+	{ "none", CONSISTENCY_POLICY_NONE},
+	{ "resync", CONSISTENCY_POLICY_RESYNC},
+	{ "bitmap", CONSISTENCY_POLICY_BITMAP},
+	{ "journal", CONSISTENCY_POLICY_JOURNAL},
+	{ "ppl", CONSISTENCY_POLICY_PPL},
+	{ NULL, 0}
+};
+
 char *map_num(mapping_t *map, int num)
 {
 	while (map->name) {
diff --git a/mdadm.8.in b/mdadm.8.in
index df1d460d..cad5db53 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -724,7 +724,9 @@ When creating an array on devices which are 100G or larger,
 .I mdadm
 automatically adds an internal bitmap as it will usually be
 beneficial.  This can be suppressed with
-.B "\-\-bitmap=none".
+.B "\-\-bitmap=none"
+or by selecting a different consistency policy with
+.BR \-\-consistency\-policy .
 
 .TP
 .BR \-\-bitmap\-chunk=
@@ -1020,6 +1022,36 @@ should be a SSD with reasonable lifetime.
 Auto creation of symlinks in /dev to /dev/md, option --symlinks must
 be 'no' or 'yes' and work with --create and --build.
 
+.TP
+.BR \-k ", " \-\-consistency\-policy=
+Specify how the array maintains consistency in case of unexpected shutdown.
+Only relevant for RAID levels with redundancy.
+Currently supported options are:
+.RS
+
+.TP
+.B resync
+Full resync is performed and all redundancy is regenerated when the array is
+started after unclean shutdown.
+
+.TP
+.B bitmap
+Resync assisted by a write-intent bitmap. Implicitly selected when using
+.BR \-\-bitmap .
+
+.TP
+.B journal
+For RAID levels 4/5/6, journal device is used to log transactions and replay
+after unclean shutdown. Implicitly selected when using
+.BR \-\-write\-journal .
+
+.TP
+.B ppl
+For RAID5 only, Partial Parity Log is used to close the write hole and
+eliminate resync. PPL is stored in the metadata region of RAID member drives,
+no additional journal drive is needed.
+.RE
+
 
 .SH For assemble:
 
@@ -2153,8 +2185,10 @@ in the array exceed 100G is size, an internal write-intent bitmap
 will automatically be added unless some other option is explicitly
 requested with the
 .B \-\-bitmap
-option.  In any case space for a bitmap will be reserved so that one
-can be added layer with
+option or a different consistency policy is selected with the
+.B \-\-consistency\-policy
+option. In any case space for a bitmap will be reserved so that one
+can be added later with
 .BR "\-\-grow \-\-bitmap=internal" .
 
 If the metadata type supports it (currently only 1.x metadata), space
diff --git a/mdadm.c b/mdadm.c
index 08ddcabc..d4e82866 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -78,6 +78,7 @@ int main(int argc, char *argv[])
 		.level		= UnSet,
 		.layout		= UnSet,
 		.bitmap_chunk	= UnSet,
+		.consistency_policy	= UnSet,
 	};
 
 	char sys_hostname[256];
@@ -1215,6 +1216,16 @@ int main(int argc, char *argv[])
 
 			s.journaldisks = 1;
 			continue;
+		case O(CREATE, 'k'):
+			s.consistency_policy = map_name(consistency_policies,
+							optarg);
+			if (s.consistency_policy == UnSet ||
+			    s.consistency_policy < CONSISTENCY_POLICY_RESYNC) {
+				pr_err("Invalid consistency policy: %s\n",
+				       optarg);
+				exit(2);
+			}
+			continue;
 		}
 		/* We have now processed all the valid options. Anything else is
 		 * an error
@@ -1242,9 +1253,47 @@ int main(int argc, char *argv[])
 		exit(0);
 	}
 
-	if (s.journaldisks && (s.level < 4 || s.level > 6)) {
-		pr_err("--write-journal is only supported for RAID level 4/5/6.\n");
-		exit(2);
+	if (s.journaldisks) {
+		if (s.level < 4 || s.level > 6) {
+			pr_err("--write-journal is only supported for RAID level 4/5/6.\n");
+			exit(2);
+		}
+		if (s.consistency_policy != UnSet &&
+		    s.consistency_policy != CONSISTENCY_POLICY_JOURNAL) {
+			pr_err("--write-journal is not supported with consistency policy: %s\n",
+			       map_num(consistency_policies, s.consistency_policy));
+			exit(2);
+		}
+	}
+
+	if (mode == CREATE && s.consistency_policy != UnSet) {
+		if (s.level <= 0) {
+			pr_err("--consistency-policy not meaningful with level %s.\n",
+			       map_num(pers, s.level));
+			exit(2);
+		} else if (s.consistency_policy == CONSISTENCY_POLICY_JOURNAL &&
+			   !s.journaldisks) {
+			pr_err("--write-journal is required for consistency policy: %s\n",
+			       map_num(consistency_policies, s.consistency_policy));
+			exit(2);
+		} else if (s.consistency_policy == CONSISTENCY_POLICY_PPL &&
+			   s.level != 5) {
+			pr_err("PPL consistency policy is only supported for RAID level 5.\n");
+			exit(2);
+		} else if (s.consistency_policy == CONSISTENCY_POLICY_BITMAP &&
+			   (!s.bitmap_file ||
+			    strcmp(s.bitmap_file, "none") == 0)) {
+			pr_err("--bitmap is required for consistency policy: %s\n",
+			       map_num(consistency_policies, s.consistency_policy));
+			exit(2);
+		} else if (s.bitmap_file &&
+			   strcmp(s.bitmap_file, "none") != 0 &&
+			   s.consistency_policy != CONSISTENCY_POLICY_BITMAP &&
+			   s.consistency_policy != CONSISTENCY_POLICY_JOURNAL) {
+			pr_err("--bitmap is not compatible with consistency policy: %s\n",
+			       map_num(consistency_policies, s.consistency_policy));
+			exit(2);
+		}
 	}
 
 	if (!mode && devs_found) {
diff --git a/mdadm.h b/mdadm.h
index cebc0c0a..b52d4d30 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -279,6 +279,15 @@ struct mdinfo {
 	int journal_device_required;
 	int journal_clean;
 
+	enum {
+		CONSISTENCY_POLICY_UNKNOWN,
+		CONSISTENCY_POLICY_NONE,
+		CONSISTENCY_POLICY_RESYNC,
+		CONSISTENCY_POLICY_BITMAP,
+		CONSISTENCY_POLICY_JOURNAL,
+		CONSISTENCY_POLICY_PPL,
+	} consistency_policy;
+
 	/* During reshape we can sometimes change the data_offset to avoid
 	 * over-writing still-valid data.  We need to know if there is space.
 	 * So getinfo_super will fill in space_before and space_after in sectors.
@@ -426,6 +435,7 @@ enum special_options {
 	ClusterName,
 	ClusterConfirm,
 	WriteJournal,
+	ConsistencyPolicy,
 };
 
 enum prefix_standard {
@@ -527,6 +537,7 @@ struct shape {
 	int	assume_clean;
 	int	write_behind;
 	unsigned long long size;
+	int	consistency_policy;
 };
 
 /* List of device names - wildcards expanded */
@@ -618,6 +629,7 @@ enum sysfs_read_flags {
 	GET_STATE	= (1 << 23),
 	GET_ERROR	= (1 << 24),
 	GET_ARRAY_STATE = (1 << 25),
+	GET_CONSISTENCY_POLICY	= (1 << 26),
 };
 
 /* If fd >= 0, get the array it is open on,
@@ -701,7 +713,7 @@ extern int restore_stripes(int *dest, unsigned long long *offsets,
 
 extern char *map_num(mapping_t *map, int num);
 extern int map_name(mapping_t *map, char *name);
-extern mapping_t r5layout[], r6layout[], pers[], modes[], faultylayout[];
+extern mapping_t r5layout[], r6layout[], pers[], modes[], faultylayout[], consistency_policies[];
 
 extern char *map_dev_preferred(int major, int minor, int create,
 			       char *prefer);
@@ -863,7 +875,7 @@ extern struct superswitch {
 	 * metadata.
 	 */
 	int (*init_super)(struct supertype *st, mdu_array_info_t *info,
-			  unsigned long long size, char *name,
+			  struct shape *s, char *name,
 			  char *homehost, int *uuid,
 			  unsigned long long data_offset);
 
@@ -961,7 +973,7 @@ extern struct superswitch {
 				 int *chunk, unsigned long long size,
 				 unsigned long long data_offset,
 				 char *subdev, unsigned long long *freesize,
-				 int verbose);
+				 int consistency_policy, int verbose);
 
 	/* Return a linked list of 'mdinfo' structures for all arrays
 	 * in the container.  For non-containers, it is like
@@ -1059,6 +1071,9 @@ extern struct superswitch {
 	/* validate container after assemble */
 	int (*validate_container)(struct mdinfo *info);
 
+	/* write initial empty PPL on device */
+	int (*write_init_ppl)(struct supertype *st, struct mdinfo *info, int fd);
+
 	/* records new bad block in metadata */
 	int (*record_bad_block)(struct active_array *a, int n,
 					unsigned long long sector, int length);
diff --git a/super-ddf.c b/super-ddf.c
index 1707ad1e..cdd16a47 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -2290,7 +2290,7 @@ static unsigned int find_vde_by_guid(const struct ddf_super *ddf,
 
 static int init_super_ddf(struct supertype *st,
 			  mdu_array_info_t *info,
-			  unsigned long long size, char *name, char *homehost,
+			  struct shape *s, char *name, char *homehost,
 			  int *uuid, unsigned long long data_offset)
 {
 	/* This is primarily called by Create when creating a new array.
@@ -2328,7 +2328,7 @@ static int init_super_ddf(struct supertype *st,
 	struct virtual_disk *vd;
 
 	if (st->sb)
-		return init_super_ddf_bvd(st, info, size, name, homehost, uuid,
+		return init_super_ddf_bvd(st, info, s->size, name, homehost, uuid,
 					  data_offset);
 
 	if (posix_memalign((void**)&ddf, 512, sizeof(*ddf)) != 0) {
@@ -3347,7 +3347,7 @@ static int validate_geometry_ddf(struct supertype *st,
 				 int *chunk, unsigned long long size,
 				 unsigned long long data_offset,
 				 char *dev, unsigned long long *freesize,
-				 int verbose)
+				 int consistency_policy, int verbose)
 {
 	int fd;
 	struct mdinfo *sra;
diff --git a/super-gpt.c b/super-gpt.c
index 8b080a05..bb38a975 100644
--- a/super-gpt.c
+++ b/super-gpt.c
@@ -205,7 +205,7 @@ static int validate_geometry(struct supertype *st, int level,
 			     int *chunk, unsigned long long size,
 			     unsigned long long data_offset,
 			     char *subdev, unsigned long long *freesize,
-			     int verbose)
+			     int consistency_policy, int verbose)
 {
 	pr_err("gpt metadata cannot be used this way\n");
 	return 0;
diff --git a/super-intel.c b/super-intel.c
index e1618f1c..5d0f131f 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -5155,7 +5155,7 @@ static int check_name(struct intel_super *super, char *name, int quiet)
 }
 
 static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
-				  unsigned long long size, char *name,
+				  struct shape *s, char *name,
 				  char *homehost, int *uuid,
 				  long long data_offset)
 {
@@ -5250,7 +5250,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	strncpy((char *) dev->volume, name, MAX_RAID_SERIAL_LEN);
 	array_blocks = calc_array_size(info->level, info->raid_disks,
 					       info->layout, info->chunk_size,
-					       size * 2);
+					       s->size * 2);
 	/* round array size down to closest MB */
 	array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
 
@@ -5264,7 +5264,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	vol->curr_migr_unit = 0;
 	map = get_imsm_map(dev, MAP_0);
 	set_pba_of_lba0(map, super->create_offset);
-	set_blocks_per_member(map, info_to_blocks_per_member(info, size));
+	set_blocks_per_member(map, info_to_blocks_per_member(info, s->size));
 	map->blocks_per_strip = __cpu_to_le16(info_to_blocks_per_strip(info));
 	map->failed_disk_num = ~0;
 	if (info->level > 0)
@@ -5292,7 +5292,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 		map->num_domains = 1;
 
 	/* info->size is only int so use the 'size' parameter instead */
-	num_data_stripes = (size * 2) / info_to_blocks_per_strip(info);
+	num_data_stripes = (s->size * 2) / info_to_blocks_per_strip(info);
 	num_data_stripes /= map->num_domains;
 	set_num_data_stripes(map, num_data_stripes);
 
@@ -5314,7 +5314,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 }
 
 static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
-			   unsigned long long size, char *name,
+		           struct shape *s, char *name,
 			   char *homehost, int *uuid,
 			   unsigned long long data_offset)
 {
@@ -5337,7 +5337,7 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
 	}
 
 	if (st->sb)
-		return init_super_imsm_volume(st, info, size, name, homehost, uuid,
+		return init_super_imsm_volume(st, info, s, name, homehost, uuid,
 					      data_offset);
 
 	if (info)
@@ -6914,7 +6914,7 @@ static int validate_geometry_imsm(struct supertype *st, int level, int layout,
 				  int raiddisks, int *chunk, unsigned long long size,
 				  unsigned long long data_offset,
 				  char *dev, unsigned long long *freesize,
-				  int verbose)
+				  int consistency_policy, int verbose)
 {
 	int fd, cfd;
 	struct mdinfo *sra;
@@ -10953,7 +10953,7 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 				    geo->raid_disks + devNumChange,
 				    &chunk,
 				    geo->size, INVALID_SECTORS,
-				    0, 0, 1))
+				    0, 0, info.consistency_policy, 1))
 		change = -1;
 
 	if (check_devs) {
diff --git a/super-mbr.c b/super-mbr.c
index f5e4ceab..1bbe57a9 100644
--- a/super-mbr.c
+++ b/super-mbr.c
@@ -193,7 +193,7 @@ static int validate_geometry(struct supertype *st, int level,
 			     int *chunk, unsigned long long size,
 			     unsigned long long data_offset,
 			     char *subdev, unsigned long long *freesize,
-			     int verbose)
+			     int consistency_policy, int verbose)
 {
 	pr_err("mbr metadata cannot be used this way\n");
 	return 0;
diff --git a/super0.c b/super0.c
index f5b4507b..7a555e3b 100644
--- a/super0.c
+++ b/super0.c
@@ -725,7 +725,7 @@ static int update_super0(struct supertype *st, struct mdinfo *info,
  * We use the first 8 bytes (64bits) of the sha1 of the host name
  */
 static int init_super0(struct supertype *st, mdu_array_info_t *info,
-		       unsigned long long size, char *ignored_name,
+		       struct shape *s, char *ignored_name,
 		       char *homehost, int *uuid,
 		       unsigned long long data_offset)
 {
@@ -764,8 +764,8 @@ static int init_super0(struct supertype *st, mdu_array_info_t *info,
 	sb->gvalid_words = 0; /* ignored */
 	sb->ctime = time(0);
 	sb->level = info->level;
-	sb->size = size;
-	if (size != (unsigned long long)sb->size)
+	sb->size = s->size;
+	if (s->size != (unsigned long long)sb->size)
 		return 0;
 	sb->nr_disks = info->nr_disks;
 	sb->raid_disks = info->raid_disks;
@@ -1267,7 +1267,7 @@ static int validate_geometry0(struct supertype *st, int level,
 			      int *chunk, unsigned long long size,
 			      unsigned long long data_offset,
 			      char *subdev, unsigned long long *freesize,
-			      int verbose)
+			      int consistency_policy, int verbose)
 {
 	unsigned long long ldsize;
 	int fd;
diff --git a/super1.c b/super1.c
index f3520ac8..4a0f0416 100644
--- a/super1.c
+++ b/super1.c
@@ -1397,7 +1397,7 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 }
 
 static int init_super1(struct supertype *st, mdu_array_info_t *info,
-		       unsigned long long size, char *name, char *homehost,
+		       struct shape *s, char *name, char *homehost,
 		       int *uuid, unsigned long long data_offset)
 {
 	struct mdp_superblock_1 *sb;
@@ -1450,7 +1450,7 @@ static int init_super1(struct supertype *st, mdu_array_info_t *info,
 	sb->ctime = __cpu_to_le64((unsigned long long)time(0));
 	sb->level = __cpu_to_le32(info->level);
 	sb->layout = __cpu_to_le32(info->layout);
-	sb->size = __cpu_to_le64(size*2ULL);
+	sb->size = __cpu_to_le64(s->size*2ULL);
 	sb->chunksize = __cpu_to_le32(info->chunk_size>>9);
 	sb->raid_disks = __cpu_to_le32(info->raid_disks);
 
@@ -2487,7 +2487,7 @@ static int validate_geometry1(struct supertype *st, int level,
 			      int *chunk, unsigned long long size,
 			      unsigned long long data_offset,
 			      char *subdev, unsigned long long *freesize,
-			      int verbose)
+			      int consistency_policy, int verbose)
 {
 	unsigned long long ldsize, devsize;
 	int bmspace;
diff --git a/sysfs.c b/sysfs.c
index b0657a04..53589a76 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -242,6 +242,17 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
 	} else
 		sra->sysfs_array_state[0] = 0;
 
+	if (options & GET_CONSISTENCY_POLICY) {
+		strcpy(base, "consistency_policy");
+		if (load_sys(fname, buf, sizeof(buf))) {
+			sra->consistency_policy = CONSISTENCY_POLICY_UNKNOWN;
+		} else {
+			sra->consistency_policy = map_name(consistency_policies, buf);
+			if (sra->consistency_policy == UnSet)
+				sra->consistency_policy = CONSISTENCY_POLICY_UNKNOWN;
+		}
+	}
+
 	if (! (options & GET_DEVS))
 		return sra;
 
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 2/6] Detail: show consistency policy
From: Artur Paszkiewicz @ 2017-03-29  9:54 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170329095420.17855-1-artur.paszkiewicz@intel.com>

Show the currently enabled consistency policy in the output from
--detail. Add 3 spaces to all existing items in Detail output to align
with "Consistency Policy : ".

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Detail.c      | 90 +++++++++++++++++++++++++++++++++++------------------------
 super-ddf.c   |  6 ++--
 super-intel.c |  2 +-
 super0.c      |  4 +--
 super1.c      |  9 +++---
 5 files changed, 65 insertions(+), 46 deletions(-)

diff --git a/Detail.c b/Detail.c
index 3d928553..136875be 100644
--- a/Detail.c
+++ b/Detail.c
@@ -402,24 +402,25 @@ int Detail(char *dev, struct context *c)
 		printf("%s:\n", dev);
 
 		if (container)
-			printf("      Container : %s, member %s\n", container, member);
+			printf("         Container : %s, member %s\n", container,
+			       member);
 		else {
 		if (sra && sra->array.major_version < 0)
-			printf("        Version : %s\n", sra->text_version);
+			printf("           Version : %s\n", sra->text_version);
 		else
-			printf("        Version : %d.%d\n",
+			printf("           Version : %d.%d\n",
 			       array.major_version, array.minor_version);
 		}
 
 		atime = array.ctime;
 		if (atime)
-			printf("  Creation Time : %.24s\n", ctime(&atime));
+			printf("     Creation Time : %.24s\n", ctime(&atime));
 		if (array.raid_disks == 0 && external)
 			str = "container";
 		if (str)
-			printf("     Raid Level : %s\n", str);
+			printf("        Raid Level : %s\n", str);
 		if (larray_size)
-			printf("     Array Size : %llu%s\n", (larray_size>>10),
+			printf("        Array Size : %llu%s\n", (larray_size>>10),
 			       human_size(larray_size));
 		if (array.level >= 1) {
 			if (sra)
@@ -428,38 +429,38 @@ int Detail(char *dev, struct context *c)
 			    (larray_size >= 0xFFFFFFFFULL|| array.size == 0)) {
 				unsigned long long dsize = get_component_size(fd);
 				if (dsize > 0)
-					printf("  Used Dev Size : %llu%s\n",
+					printf("     Used Dev Size : %llu%s\n",
 					       dsize/2,
 					 human_size((long long)dsize<<9));
 				else
-					printf("  Used Dev Size : unknown\n");
+					printf("     Used Dev Size : unknown\n");
 			} else
-				printf("  Used Dev Size : %lu%s\n",
+				printf("     Used Dev Size : %lu%s\n",
 				       (unsigned long)array.size,
 				       human_size((unsigned long long)array.size<<10));
 		}
 		if (array.raid_disks)
-			printf("   Raid Devices : %d\n", array.raid_disks);
-		printf("  Total Devices : %d\n", array.nr_disks);
+			printf("      Raid Devices : %d\n", array.raid_disks);
+		printf("     Total Devices : %d\n", array.nr_disks);
 		if (!container &&
 		    ((sra == NULL && array.major_version == 0) ||
 		     (sra && sra->array.major_version == 0)))
-			printf("Preferred Minor : %d\n", array.md_minor);
+			printf("   Preferred Minor : %d\n", array.md_minor);
 		if (sra == NULL || sra->array.major_version >= 0)
-			printf("    Persistence : Superblock is %spersistent\n",
+			printf("       Persistence : Superblock is %spersistent\n",
 			       array.not_persistent?"not ":"");
 		printf("\n");
 		/* Only try GET_BITMAP_FILE for 0.90.01 and later */
 		if (vers >= 9001 &&
 		    ioctl(fd, GET_BITMAP_FILE, &bmf) == 0 &&
 		    bmf.pathname[0]) {
-			printf("  Intent Bitmap : %s\n", bmf.pathname);
+			printf("     Intent Bitmap : %s\n", bmf.pathname);
 			printf("\n");
 		} else if (array.state & (1<<MD_SB_BITMAP_PRESENT))
-			printf("  Intent Bitmap : Internal\n\n");
+			printf("     Intent Bitmap : Internal\n\n");
 		atime = array.utime;
 		if (atime)
-			printf("    Update Time : %.24s\n", ctime(&atime));
+			printf("       Update Time : %.24s\n", ctime(&atime));
 		if (array.raid_disks) {
 			static char *sync_action[] = {
 				", recovering",  ", resyncing",
@@ -473,7 +474,7 @@ int Detail(char *dev, struct context *c)
 			else
 				st = ", degraded";
 
-			printf("          State : %s%s%s%s%s%s \n",
+			printf("             State : %s%s%s%s%s%s \n",
 			       (array.state&(1<<MD_SB_CLEAN))?"clean":"active", st,
 			       (!e || (e->percent < 0 && e->percent != RESYNC_PENDING &&
 			       e->percent != RESYNC_DELAYED)) ? "" : sync_action[e->resync],
@@ -481,27 +482,27 @@ int Detail(char *dev, struct context *c)
 			       (e && e->percent == RESYNC_DELAYED) ? " (DELAYED)": "",
 			       (e && e->percent == RESYNC_PENDING) ? " (PENDING)": "");
 		} else if (inactive) {
-			printf("          State : inactive\n");
+			printf("             State : inactive\n");
 		}
 		if (array.raid_disks)
-			printf(" Active Devices : %d\n", array.active_disks);
+			printf("    Active Devices : %d\n", array.active_disks);
 		if (array.working_disks > 0)
-			printf("Working Devices : %d\n", array.working_disks);
+			printf("   Working Devices : %d\n", array.working_disks);
 		if (array.raid_disks) {
-			printf(" Failed Devices : %d\n", array.failed_disks);
-			printf("  Spare Devices : %d\n", array.spare_disks);
+			printf("    Failed Devices : %d\n", array.failed_disks);
+			printf("     Spare Devices : %d\n", array.spare_disks);
 		}
 		printf("\n");
 		if (array.level == 5) {
 			str = map_num(r5layout, array.layout);
-			printf("         Layout : %s\n", str?str:"-unknown-");
+			printf("            Layout : %s\n", str?str:"-unknown-");
 		}
 		if (array.level == 6) {
 			str = map_num(r6layout, array.layout);
-			printf("         Layout : %s\n", str?str:"-unknown-");
+			printf("            Layout : %s\n", str?str:"-unknown-");
 		}
 		if (array.level == 10) {
-			printf("         Layout :");
+			printf("            Layout :");
 			print_r10_layout(array.layout);
 			printf("\n");
 		}
@@ -512,20 +513,35 @@ int Detail(char *dev, struct context *c)
 		case 10:
 		case 6:
 			if (array.chunk_size)
-				printf("     Chunk Size : %dK\n\n",
+				printf("        Chunk Size : %dK\n\n",
 				       array.chunk_size/1024);
 			break;
 		case -1:
-			printf("       Rounding : %dK\n\n", array.chunk_size/1024);
+			printf("          Rounding : %dK\n\n",
+			       array.chunk_size/1024);
 			break;
 		default: break;
 		}
 
+		if (array.raid_disks) {
+			struct mdinfo *mdi = sysfs_read(fd, NULL,
+							GET_CONSISTENCY_POLICY);
+			if (mdi) {
+				char *policy = map_num(consistency_policies,
+						       mdi->consistency_policy);
+				sysfs_free(mdi);
+				if (policy)
+					printf("Consistency Policy : %s\n\n",
+					       policy);
+			}
+		}
+
 		if (e && e->percent >= 0) {
 			static char *sync_action[] = {
 				"Rebuild", "Resync",
 				"Reshape", "Check"};
-			printf(" %7s Status : %d%% complete\n", sync_action[e->resync], e->percent);
+			printf("    %7s Status : %d%% complete\n",
+			       sync_action[e->resync], e->percent);
 			is_rebuilding = 1;
 		}
 		free_mdstat(ms);
@@ -533,39 +549,41 @@ int Detail(char *dev, struct context *c)
 		if ((st && st->sb) && (info && info->reshape_active)) {
 #if 0
 This is pretty boring
-			printf("  Reshape pos'n : %llu%s\n", (unsigned long long) info->reshape_progress<<9,
+			printf("     Reshape pos'n : %llu%s\n",
+			       (unsigned long long) info->reshape_progress<<9,
 			       human_size((unsigned long long)info->reshape_progress<<9));
 #endif
 			if (info->delta_disks != 0)
-				printf("  Delta Devices : %d, (%d->%d)\n",
+				printf("     Delta Devices : %d, (%d->%d)\n",
 				       info->delta_disks,
 				       array.raid_disks - info->delta_disks,
 				       array.raid_disks);
 			if (info->new_level != array.level) {
 				str = map_num(pers, info->new_level);
-				printf("      New Level : %s\n", str?str:"-unknown-");
+				printf("         New Level : %s\n", str?str:"-unknown-");
 			}
 			if (info->new_level != array.level ||
 			    info->new_layout != array.layout) {
 				if (info->new_level == 5) {
 					str = map_num(r5layout, info->new_layout);
-					printf("     New Layout : %s\n",
+					printf("        New Layout : %s\n",
 					       str?str:"-unknown-");
 				}
 				if (info->new_level == 6) {
 					str = map_num(r6layout, info->new_layout);
-					printf("     New Layout : %s\n",
+					printf("        New Layout : %s\n",
 					       str?str:"-unknown-");
 				}
 				if (info->new_level == 10) {
-					printf("     New Layout : near=%d, %s=%d\n",
+					printf("        New Layout : near=%d, %s=%d\n",
 					       info->new_layout&255,
 					       (info->new_layout&0x10000)?"offset":"far",
 					       (info->new_layout>>8)&255);
 				}
 			}
 			if (info->new_chunk != array.chunk_size)
-				printf("  New Chunksize : %dK\n", info->new_chunk/1024);
+				printf("     New Chunksize : %dK\n",
+				       info->new_chunk/1024);
 			printf("\n");
 		} else if (e && e->percent >= 0)
 			printf("\n");
@@ -580,7 +598,7 @@ This is pretty boring
 			DIR *dir = opendir("/sys/block");
 			struct dirent *de;
 
-			printf("  Member Arrays :");
+			printf("     Member Arrays :");
 
 			while (dir && (de = readdir(dir)) != NULL) {
 				char path[287];
diff --git a/super-ddf.c b/super-ddf.c
index cdd16a47..c6037c1c 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -1742,10 +1742,10 @@ static void detail_super_ddf(struct supertype *st, char *homehost)
 	struct ddf_super *sb = st->sb;
 	int cnt = be16_to_cpu(sb->virt->populated_vdes);
 
-	printf(" Container GUID : "); print_guid(sb->anchor.guid, 1);
+	printf("    Container GUID : "); print_guid(sb->anchor.guid, 1);
 	printf("\n");
-	printf("            Seq : %08x\n", be32_to_cpu(sb->active->seq));
-	printf("  Virtual Disks : %d\n", cnt);
+	printf("               Seq : %08x\n", be32_to_cpu(sb->active->seq));
+	printf("     Virtual Disks : %d\n", cnt);
 	printf("\n");
 }
 #endif
diff --git a/super-intel.c b/super-intel.c
index 5d0f131f..2d92c8e3 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -1987,7 +1987,7 @@ static void detail_super_imsm(struct supertype *st, char *homehost)
 
 	getinfo_super_imsm(st, &info, NULL);
 	fname_from_uuid(st, &info, nbuf, ':');
-	printf("\n           UUID : %s\n", nbuf + 5);
+	printf("\n              UUID : %s\n", nbuf + 5);
 }
 
 static void brief_detail_super_imsm(struct supertype *st)
diff --git a/super0.c b/super0.c
index 7a555e3b..10d9c40d 100644
--- a/super0.c
+++ b/super0.c
@@ -353,7 +353,7 @@ err:
 static void detail_super0(struct supertype *st, char *homehost)
 {
 	mdp_super_t *sb = st->sb;
-	printf("           UUID : ");
+	printf("              UUID : ");
 	if (sb->minor_version >= 90)
 		printf("%08x:%08x:%08x:%08x", sb->set_uuid0, sb->set_uuid1,
 		       sb->set_uuid2, sb->set_uuid3);
@@ -367,7 +367,7 @@ static void detail_super0(struct supertype *st, char *homehost)
 		if (memcmp(&sb->set_uuid2, hash, 8)==0)
 			printf(" (local to host %s)", homehost);
 	}
-	printf("\n         Events : %d.%d\n\n", sb->events_hi, sb->events_lo);
+	printf("\n            Events : %d.%d\n\n", sb->events_hi, sb->events_lo);
 }
 
 static void brief_detail_super0(struct supertype *st)
diff --git a/super1.c b/super1.c
index 4a0f0416..8df17a1c 100644
--- a/super1.c
+++ b/super1.c
@@ -780,19 +780,20 @@ static void detail_super1(struct supertype *st, char *homehost)
 	int i;
 	int l = homehost ? strlen(homehost) : 0;
 
-	printf("           Name : %.32s", sb->set_name);
+	printf("              Name : %.32s", sb->set_name);
 	if (l > 0 && l < 32 &&
 	    sb->set_name[l] == ':' &&
 	    strncmp(sb->set_name, homehost, l) == 0)
 		printf("  (local to host %s)", homehost);
 	if (bms->nodes > 0 && (__le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET))
-	    printf("\n   Cluster Name : %-64s", bms->cluster_name);
-	printf("\n           UUID : ");
+	    printf("\n      Cluster Name : %-64s", bms->cluster_name);
+	printf("\n              UUID : ");
 	for (i=0; i<16; i++) {
 		if ((i&3)==0 && i != 0) printf(":");
 		printf("%02x", sb->set_uuid[i]);
 	}
-	printf("\n         Events : %llu\n\n", (unsigned long long)__le64_to_cpu(sb->events));
+	printf("\n            Events : %llu\n\n",
+	       (unsigned long long)__le64_to_cpu(sb->events));
 }
 
 static void brief_detail_super1(struct supertype *st)
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 3/6] imsm: PPL support
From: Artur Paszkiewicz @ 2017-03-29  9:54 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170329095420.17855-1-artur.paszkiewicz@intel.com>

Enable creating and assembling IMSM raid5 arrays with PPL. Update the
IMSM metadata format to include new fields used for PPL.

Add structures for PPL metadata. They are used also by super1 and shared
with the kernel, so put them in md_p.h.

Write the initial empty PPL header when creating an array. When
assembling an array with PPL, validate the PPL header and in case it is
not correct allow to overwrite it if --force was provided.

Write the PPL location and size for a device to the new rdev sysfs
attributes 'ppl_sector' and 'ppl_size'. Enable PPL in the kernel by
writing to 'consistency_policy' before the array is activated.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Assemble.c    |  49 +++++++++++
 Makefile      |   5 +-
 md_p.h        |  25 ++++++
 mdadm.h       |   6 ++
 super-intel.c | 274 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 sysfs.c       |  14 +++
 6 files changed, 349 insertions(+), 24 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 3da09033..8e55b49f 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1942,6 +1942,55 @@ int assemble_container_content(struct supertype *st, int mdfd,
 	map_update(NULL, fd2devnm(mdfd), content->text_version,
 		   content->uuid, chosen_name);
 
+	if (content->consistency_policy == CONSISTENCY_POLICY_PPL &&
+	    st->ss->validate_ppl) {
+		content->array.state |= 1;
+		err = 0;
+
+		for (dev = content->devs; dev; dev = dev->next) {
+			int dfd;
+			char *devpath;
+			int ret;
+
+			ret = st->ss->validate_ppl(st, content, dev);
+			if (ret == 0)
+				continue;
+
+			if (ret < 0) {
+				err = 1;
+				break;
+			}
+
+			if (!c->force) {
+				pr_err("%s contains invalid PPL - consider --force or --update-subarray with --update=no-ppl\n",
+					chosen_name);
+				content->array.state &= ~1;
+				avail[dev->disk.raid_disk] = 0;
+				break;
+			}
+
+			/* have --force - overwrite the invalid ppl */
+			devpath = map_dev(dev->disk.major, dev->disk.minor, 0);
+			dfd = dev_open(devpath, O_RDWR);
+			if (dfd < 0) {
+				pr_err("Failed to open %s\n", devpath);
+				err = 1;
+				break;
+			}
+
+			err = st->ss->write_init_ppl(st, content, dfd);
+			close(dfd);
+
+			if (err)
+				break;
+		}
+
+		if (err) {
+			free(avail);
+			return err;
+		}
+	}
+
 	if (enough(content->array.level, content->array.raid_disks,
 		   content->array.layout, content->array.state & 1, avail) == 0) {
 		if (c->export && result)
diff --git a/Makefile b/Makefile
index d1a6ac46..5ff6cc09 100644
--- a/Makefile
+++ b/Makefile
@@ -151,7 +151,7 @@ MON_OBJS = mdmon.o monitor.o managemon.o util.o maps.o mdstat.o sysfs.o \
 	Kill.o sg_io.o dlink.o ReadMe.o super-intel.o \
 	super-mbr.o super-gpt.o \
 	super-ddf.o sha1.o crc32.o msg.o bitmap.o xmalloc.o \
-	platform-intel.o probe_roms.o
+	platform-intel.o probe_roms.o crc32c.o
 
 MON_SRCS = $(patsubst %.o,%.c,$(MON_OBJS))
 
@@ -161,7 +161,8 @@ STATICOBJS = pwgr.o
 ASSEMBLE_SRCS := mdassemble.c Assemble.c Manage.c config.c policy.c dlink.c util.c \
 	maps.c lib.c xmalloc.c \
 	super0.c super1.c super-ddf.c super-intel.c sha1.c crc32.c sg_io.c mdstat.c \
-	platform-intel.c probe_roms.c sysfs.c super-mbr.c super-gpt.c mapfile.c
+	platform-intel.c probe_roms.c sysfs.c super-mbr.c super-gpt.c mapfile.c \
+	crc32c.c
 ASSEMBLE_AUTO_SRCS := mdopen.c
 ASSEMBLE_FLAGS:= $(CFLAGS) -DMDASSEMBLE
 ifdef MDASSEMBLE_AUTO
diff --git a/md_p.h b/md_p.h
index dc9fec16..358a28ce 100644
--- a/md_p.h
+++ b/md_p.h
@@ -267,4 +267,29 @@ struct r5l_meta_block {
 #define R5LOG_VERSION 0x1
 #define R5LOG_MAGIC 0x6433c509
 
+struct ppl_header_entry {
+	__u64 data_sector;	/* raid sector of the new data */
+	__u32 pp_size;		/* length of partial parity */
+	__u32 data_size;	/* length of data */
+	__u32 parity_disk;	/* member disk containing parity */
+	__u32 checksum;		/* checksum of this entry's partial parity */
+} __attribute__ ((__packed__));
+
+#define PPL_HEADER_SIZE 4096
+#define PPL_HDR_RESERVED 512
+#define PPL_HDR_ENTRY_SPACE \
+	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(__u32) - sizeof(__u64))
+#define PPL_HDR_MAX_ENTRIES \
+	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
+
+struct ppl_header {
+	__u8 reserved[PPL_HDR_RESERVED];/* reserved space, fill with 0xff */
+	__u32 signature;		/* signature (family number of volume) */
+	__u32 padding;			/* zero pad */
+	__u64 generation;		/* generation number of the header */
+	__u32 entries_count;		/* number of entries in entry array */
+	__u32 checksum;			/* checksum of the header */
+	struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
+} __attribute__ ((__packed__));
+
 #endif
diff --git a/mdadm.h b/mdadm.h
index b52d4d30..d222cc3b 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -300,6 +300,8 @@ struct mdinfo {
 		#define MaxSector  (~0ULL) /* resync/recovery complete position */
 	};
 	long			bitmap_offset;	/* 0 == none, 1 == a file */
+	unsigned int		ppl_size;
+	unsigned long long	ppl_sector;
 	unsigned long		safe_mode_delay; /* ms delay to mark clean */
 	int			new_level, delta_disks, new_layout, new_chunk;
 	int			errors;
@@ -1074,6 +1076,10 @@ extern struct superswitch {
 	/* write initial empty PPL on device */
 	int (*write_init_ppl)(struct supertype *st, struct mdinfo *info, int fd);
 
+	/* validate ppl before assemble */
+	int (*validate_ppl)(struct supertype *st, struct mdinfo *info,
+			    struct mdinfo *disk);
+
 	/* records new bad block in metadata */
 	int (*record_bad_block)(struct active_array *a, int n,
 					unsigned long long sector, int length);
diff --git a/super-intel.c b/super-intel.c
index 2d92c8e3..87fec8bc 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -102,6 +102,7 @@ struct imsm_disk {
 #define SPARE_DISK      __cpu_to_le32(0x01)  /* Spare */
 #define CONFIGURED_DISK __cpu_to_le32(0x02)  /* Member of some RaidDev */
 #define FAILED_DISK     __cpu_to_le32(0x04)  /* Permanent failure */
+#define JOURNAL_DISK    __cpu_to_le32(0x2000000) /* Device marked as Journaling Drive */
 	__u32 status;			 /* 0xF0 - 0xF3 */
 	__u32 owner_cfg_num; /* which config 0,1,2... owns this disk */
 	__u32 total_blocks_hi;		 /* 0xF4 - 0xF5 total blocks hi */
@@ -155,6 +156,9 @@ struct imsm_vol {
 #define MIGR_STATE_CHANGE 4
 #define MIGR_REPAIR 5
 	__u8  migr_type;	/* Initializing, Rebuilding, ... */
+#define RAIDVOL_CLEAN          0
+#define RAIDVOL_DIRTY          1
+#define RAIDVOL_DSRECORD_VALID 2
 	__u8  dirty;
 	__u8  fs_state;		/* fast-sync state for CnG (0xff == disabled) */
 	__u16 verify_errors;	/* number of mismatches */
@@ -190,7 +194,24 @@ struct imsm_dev {
 	__u16 cache_policy;
 	__u8  cng_state;
 	__u8  cng_sub_state;
-#define IMSM_DEV_FILLERS 10
+	__u16 my_vol_raid_dev_num; /* Used in Unique volume Id for this RaidDev */
+
+	/* NVM_EN */
+	__u8 nv_cache_mode;
+	__u8 nv_cache_flags;
+
+	/* Unique Volume Id of the NvCache Volume associated with this volume */
+	__u32 nvc_vol_orig_family_num;
+	__u16 nvc_vol_raid_dev_num;
+
+#define RWH_OFF 0
+#define RWH_DISTRIBUTED 1
+#define RWH_JOURNALING_DRIVE 2
+	__u8  rwh_policy; /* Raid Write Hole Policy */
+	__u8  jd_serial[MAX_RAID_SERIAL_LEN]; /* Journal Drive serial number */
+	__u8  filler1;
+
+#define IMSM_DEV_FILLERS 3
 	__u32 filler[IMSM_DEV_FILLERS];
 	struct imsm_vol vol;
 } __attribute__ ((packed));
@@ -257,6 +278,9 @@ static char *map_state_str[] = { "normal", "uninitialized", "degraded", "failed"
 #define UNIT_SRC_IN_CP_AREA 1   /* Source data for curr_migr_unit has
 				 *  already been migrated and must
 				 *  be recovered from checkpoint area */
+
+#define PPL_ENTRY_SPACE (128 * 1024) /* Size of the PPL, without the header */
+
 struct migr_record {
 	__u32 rec_status;	    /* Status used to determine how to restart
 				     * migration in case it aborts
@@ -1288,6 +1312,11 @@ static int is_failed(struct imsm_disk *disk)
 	return (disk->status & FAILED_DISK) == FAILED_DISK;
 }
 
+static int is_journal(struct imsm_disk *disk)
+{
+	return (disk->status & JOURNAL_DISK) == JOURNAL_DISK;
+}
+
 /* try to determine how much space is reserved for metadata from
  * the last get_extents() entry on the smallest active disk,
  * otherwise fallback to the default
@@ -1477,7 +1506,17 @@ static void print_imsm_dev(struct intel_super *super,
 				   blocks_per_migr_unit(super, dev));
 	}
 	printf("\n");
-	printf("    Dirty State : %s\n", dev->vol.dirty ? "dirty" : "clean");
+	printf("    Dirty State : %s\n", (dev->vol.dirty & RAIDVOL_DIRTY) ?
+					 "dirty" : "clean");
+	printf("     RWH Policy : ");
+	if (dev->rwh_policy == RWH_OFF)
+		printf("off\n");
+	else if (dev->rwh_policy == RWH_DISTRIBUTED)
+		printf("PPL distributed\n");
+	else if (dev->rwh_policy == RWH_JOURNALING_DRIVE)
+		printf("PPL journaling drive\n");
+	else
+		printf("<unknown:%d>\n", dev->rwh_policy);
 }
 
 static void print_imsm_disk(struct imsm_disk *disk,
@@ -1496,9 +1535,10 @@ static void print_imsm_disk(struct imsm_disk *disk,
 		printf("  Disk%02d Serial : %s\n", index, str);
 	else
 		printf("    Disk Serial : %s\n", str);
-	printf("          State :%s%s%s\n", is_spare(disk) ? " spare" : "",
-					    is_configured(disk) ? " active" : "",
-					    is_failed(disk) ? " failed" : "");
+	printf("          State :%s%s%s%s\n", is_spare(disk) ? " spare" : "",
+					      is_configured(disk) ? " active" : "",
+					      is_failed(disk) ? " failed" : "",
+					      is_journal(disk) ? " journal" : "");
 	printf("             Id : %08x\n", __le32_to_cpu(disk->scsi_id));
 	sz = total_blocks(disk) - reserved;
 	printf("    Usable Size : %llu%s\n",
@@ -3114,6 +3154,15 @@ static unsigned long long imsm_component_size_aligment_check(int level,
 	return component_size;
 }
 
+static unsigned long long get_ppl_sector(struct intel_super *super, int dev_idx)
+{
+	struct imsm_dev *dev = get_imsm_dev(super, dev_idx);
+	struct imsm_map *map = get_imsm_map(dev, MAP_0);
+
+	return pba_of_lba0(map) +
+	       (num_data_stripes(map) * map->blocks_per_strip);
+}
+
 static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info, char *dmap)
 {
 	struct intel_super *super = st->sb;
@@ -3140,7 +3189,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	info->array.utime	  = 0;
 	info->array.chunk_size	  =
 		__le16_to_cpu(map_to_analyse->blocks_per_strip) << 9;
-	info->array.state	  = !dev->vol.dirty;
+	info->array.state	  = !(dev->vol.dirty & RAIDVOL_DIRTY);
 	info->custom_array_size   = __le32_to_cpu(dev->size_high);
 	info->custom_array_size   <<= 32;
 	info->custom_array_size   |= __le32_to_cpu(dev->size_low);
@@ -3221,10 +3270,20 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	memset(info->uuid, 0, sizeof(info->uuid));
 	info->recovery_start = MaxSector;
 
+	if (info->array.level == 5 && dev->rwh_policy == RWH_DISTRIBUTED) {
+		info->consistency_policy = CONSISTENCY_POLICY_PPL;
+		info->ppl_sector = get_ppl_sector(super, super->current_vol);
+		info->ppl_size = (PPL_HEADER_SIZE + PPL_ENTRY_SPACE) >> 9;
+	} else if (info->array.level <= 0) {
+		info->consistency_policy = CONSISTENCY_POLICY_NONE;
+	} else {
+		info->consistency_policy = CONSISTENCY_POLICY_RESYNC;
+	}
+
 	info->reshape_progress = 0;
 	info->resync_start = MaxSector;
 	if ((map_to_analyse->map_state == IMSM_T_STATE_UNINITIALIZED ||
-	    dev->vol.dirty) &&
+	    !(info->array.state & 1)) &&
 	    imsm_reshape_blocks_arrays_changes(super) == 0) {
 		info->resync_start = 0;
 	}
@@ -3451,7 +3510,8 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
 		 * found the 'most fresh' version of the metadata
 		 */
 		info->disk.state |= is_failed(disk) ? (1 << MD_DISK_FAULTY) : 0;
-		info->disk.state |= is_spare(disk) ? 0 : (1 << MD_DISK_SYNC);
+		info->disk.state |= (is_spare(disk) || is_journal(disk)) ?
+				    0 : (1 << MD_DISK_SYNC);
 	}
 
 	/* only call uuid_from_super_imsm when this disk is part of a populated container,
@@ -3906,7 +3966,7 @@ load_imsm_disk(int fd, struct intel_super *super, char *devname, int keep_fd)
 		 */
 		if (is_failed(&dl->disk))
 			dl->index = -2;
-		else if (is_spare(&dl->disk))
+		else if (is_spare(&dl->disk) || is_journal(&dl->disk))
 			dl->index = -1;
 	}
 
@@ -5303,6 +5363,20 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	}
 	mpb->num_raid_devs++;
 
+	if (s->consistency_policy == UnSet ||
+	    s->consistency_policy == CONSISTENCY_POLICY_RESYNC ||
+	    s->consistency_policy == CONSISTENCY_POLICY_NONE) {
+		dev->rwh_policy = RWH_OFF;
+	} else if (s->consistency_policy == CONSISTENCY_POLICY_PPL) {
+		dev->rwh_policy = RWH_DISTRIBUTED;
+	} else {
+		free(dev);
+		free(dv);
+		pr_err("imsm does not support consistency policy %s\n",
+		       map_num(consistency_policies, s->consistency_policy));
+		return 0;
+	}
+
 	dv->dev = dev;
 	dv->index = super->current_vol;
 	dv->next = super->devlist;
@@ -5927,11 +6001,146 @@ static int mgmt_disk(struct supertype *st)
 
 	return 0;
 }
+#endif
+
+__u32 crc32c_le(__u32 crc, unsigned char const *p, size_t len);
+
+static int write_init_ppl_imsm(struct supertype *st, struct mdinfo *info, int fd)
+{
+	struct intel_super *super = st->sb;
+	void *buf;
+	struct ppl_header *ppl_hdr;
+	int ret;
+
+	ret = posix_memalign(&buf, 4096, PPL_HEADER_SIZE);
+	if (ret) {
+		pr_err("Failed to allocate PPL header buffer\n");
+		return ret;
+	}
+
+	memset(buf, 0, PPL_HEADER_SIZE);
+	ppl_hdr = buf;
+	memset(ppl_hdr->reserved, 0xff, PPL_HDR_RESERVED);
+	ppl_hdr->signature = __cpu_to_le32(super->anchor->orig_family_num);
+	ppl_hdr->checksum = __cpu_to_le32(~crc32c_le(~0, buf, PPL_HEADER_SIZE));
+
+	if (lseek64(fd, info->ppl_sector * 512, SEEK_SET) < 0) {
+		ret = errno;
+		perror("Failed to seek to PPL header location");
+	}
+
+	if (!ret && write(fd, buf, PPL_HEADER_SIZE) != PPL_HEADER_SIZE) {
+		ret = errno;
+		perror("Write PPL header failed");
+	}
+
+	if (!ret)
+		fsync(fd);
+
+	free(buf);
+	return ret;
+}
+
+static int validate_ppl_imsm(struct supertype *st, struct mdinfo *info,
+			     struct mdinfo *disk)
+{
+	struct intel_super *super = st->sb;
+	struct dl *d;
+	void *buf;
+	int ret = 0;
+	struct ppl_header *ppl_hdr;
+	__u32 crc;
+	struct imsm_dev *dev;
+	struct imsm_map *map;
+	__u32 idx;
+
+	if (disk->disk.raid_disk < 0)
+		return 0;
+
+	if (posix_memalign(&buf, 4096, PPL_HEADER_SIZE)) {
+		pr_err("Failed to allocate PPL header buffer\n");
+		return -1;
+	}
+
+	dev = get_imsm_dev(super, info->container_member);
+	map = get_imsm_map(dev, MAP_X);
+	idx = get_imsm_disk_idx(dev, disk->disk.raid_disk, MAP_X);
+	d = get_imsm_dl_disk(super, idx);
+
+	if (!d || d->index < 0 || is_failed(&d->disk))
+		goto out;
+
+	if (lseek64(d->fd, info->ppl_sector * 512, SEEK_SET) < 0) {
+		perror("Failed to seek to PPL header location");
+		ret = -1;
+		goto out;
+	}
+
+	if (read(d->fd, buf, PPL_HEADER_SIZE) != PPL_HEADER_SIZE) {
+		perror("Read PPL header failed");
+		ret = -1;
+		goto out;
+	}
+
+	ppl_hdr = buf;
+
+	crc = __le32_to_cpu(ppl_hdr->checksum);
+	ppl_hdr->checksum = 0;
+
+	if (crc != ~crc32c_le(~0, buf, PPL_HEADER_SIZE)) {
+		dprintf("Wrong PPL header checksum on %s\n",
+			d->devname);
+		ret = 1;
+	}
+
+	if (!ret && (__le32_to_cpu(ppl_hdr->signature) !=
+		      super->anchor->orig_family_num)) {
+		dprintf("Wrong PPL header signature on %s\n",
+			d->devname);
+		ret = 1;
+	}
+
+out:
+	free(buf);
+
+	if (ret == 1 && map->map_state == IMSM_T_STATE_UNINITIALIZED)
+		return st->ss->write_init_ppl(st, info, d->fd);
+
+	return ret;
+}
+
+#ifndef MDASSEMBLE
+
+static int write_init_ppl_imsm_all(struct supertype *st, struct mdinfo *info)
+{
+	struct intel_super *super = st->sb;
+	struct dl *d;
+	int ret = 0;
+
+	if (info->consistency_policy != CONSISTENCY_POLICY_PPL ||
+	    info->array.level != 5)
+		return 0;
+
+	for (d = super->disks; d ; d = d->next) {
+		if (d->index < 0 || is_failed(&d->disk))
+			continue;
+
+		ret = st->ss->write_init_ppl(st, info, d->fd);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
 
 static int write_init_super_imsm(struct supertype *st)
 {
 	struct intel_super *super = st->sb;
 	int current_vol = super->current_vol;
+	int rv = 0;
+	struct mdinfo info;
+
+	getinfo_super_imsm(st, &info, NULL);
 
 	/* we are done with current_vol reset it to point st at the container */
 	super->current_vol = -1;
@@ -5939,24 +6148,29 @@ static int write_init_super_imsm(struct supertype *st)
 	if (st->update_tail) {
 		/* queue the recently created array / added disk
 		 * as a metadata update */
-		int rv;
 
 		/* determine if we are creating a volume or adding a disk */
 		if (current_vol < 0) {
 			/* in the mgmt (add/remove) disk case we are running
 			 * in mdmon context, so don't close fd's
 			 */
-			return mgmt_disk(st);
-		} else
-			rv = create_array(st, current_vol);
-
-		return rv;
+			rv = mgmt_disk(st);
+		} else {
+			rv = write_init_ppl_imsm_all(st, &info);
+			if (!rv)
+				rv = create_array(st, current_vol);
+		}
 	} else {
 		struct dl *d;
 		for (d = super->disks; d; d = d->next)
 			Kill(d->devname, NULL, 0, -1, 1);
-		return write_super_imsm(st, 1);
+		if (current_vol >= 0)
+			rv = write_init_ppl_imsm_all(st, &info);
+		if (!rv)
+			rv = write_super_imsm(st, 1);
 	}
+
+	return rv;
 }
 #endif
 
@@ -7375,7 +7589,8 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
 			 *
 			 * FIXME handle dirty degraded
 			 */
-			if ((skip || recovery_start == 0) && !dev->vol.dirty)
+			if ((skip || recovery_start == 0) &&
+			    !(dev->vol.dirty & RAIDVOL_DIRTY))
 				this->resync_start = MaxSector;
 			if (skip)
 				continue;
@@ -7410,9 +7625,12 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
 				info_d->component_size =
 						num_data_stripes(map) *
 						map->blocks_per_strip;
+				info_d->ppl_sector = this->ppl_sector;
+				info_d->ppl_size = this->ppl_size;
 			} else {
 				info_d->component_size = blocks_per_member(map);
 			}
+			info_d->consistency_policy = this->consistency_policy;
 
 			info_d->bb.supported = 1;
 			get_volume_badblocks(super->bbm_log, ord_to_idx(ord),
@@ -7928,12 +8146,16 @@ mark_checkpoint:
 
 skip_mark_checkpoint:
 	/* mark dirty / clean */
-	if (dev->vol.dirty != !consistent) {
+	if (((dev->vol.dirty & RAIDVOL_DIRTY) && consistent) ||
+	    (!(dev->vol.dirty & RAIDVOL_DIRTY) && !consistent)) {
 		dprintf("imsm: mark '%s'\n", consistent ? "clean" : "dirty");
-		if (consistent)
-			dev->vol.dirty = 0;
-		else
-			dev->vol.dirty = 1;
+		if (consistent) {
+			dev->vol.dirty = RAIDVOL_CLEAN;
+		} else {
+			dev->vol.dirty = RAIDVOL_DIRTY;
+			if (dev->rwh_policy == RWH_DISTRIBUTED)
+				dev->vol.dirty |= RAIDVOL_DSRECORD_VALID;
+		}
 		super->updates_pending++;
 	}
 
@@ -8445,6 +8667,11 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,
 		di->component_size = a->info.component_size;
 		di->container_member = inst;
 		di->bb.supported = 1;
+		if (dev->rwh_policy == RWH_DISTRIBUTED) {
+			di->consistency_policy = CONSISTENCY_POLICY_PPL;
+			di->ppl_sector = get_ppl_sector(super, inst);
+			di->ppl_size = (PPL_HEADER_SIZE + PPL_ENTRY_SPACE) >> 9;
+		}
 		super->random = random32();
 		di->next = rv;
 		rv = di;
@@ -11600,6 +11827,9 @@ struct superswitch super_imsm = {
 	.container_content = container_content_imsm,
 	.validate_container = validate_container_imsm,
 
+	.write_init_ppl = write_init_ppl_imsm,
+	.validate_ppl	= validate_ppl_imsm,
+
 	.external	= 1,
 	.name = "imsm",
 
diff --git a/sysfs.c b/sysfs.c
index 53589a76..2a91ba0a 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -689,6 +689,16 @@ int sysfs_set_array(struct mdinfo *info, int vers)
 		 * once the reshape completes.
 		 */
 	}
+
+	if (info->consistency_policy == CONSISTENCY_POLICY_PPL) {
+		if (sysfs_set_str(info, NULL, "consistency_policy",
+				  map_num(consistency_policies,
+					  info->consistency_policy))) {
+			pr_err("This kernel does not support PPL\n");
+			return 1;
+		}
+	}
+
 	return rv;
 }
 
@@ -720,6 +730,10 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
 	rv = sysfs_set_num(sra, sd, "offset", sd->data_offset);
 	rv |= sysfs_set_num(sra, sd, "size", (sd->component_size+1) / 2);
 	if (sra->array.level != LEVEL_CONTAINER) {
+		if (sd->consistency_policy == CONSISTENCY_POLICY_PPL) {
+			rv |= sysfs_set_num(sra, sd, "ppl_sector", sd->ppl_sector);
+			rv |= sysfs_set_num(sra, sd, "ppl_size", sd->ppl_size);
+		}
 		if (sd->recovery_start == MaxSector)
 			/* This can correctly fail if array isn't started,
 			 * yet, so just ignore status for now.
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 4/6] super1: PPL support
From: Artur Paszkiewicz @ 2017-03-29  9:54 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170329095420.17855-1-artur.paszkiewicz@intel.com>

Enable creating and assembling raid5 arrays with PPL for 1.x metadata.

When creating, reserve enough space for PPL and store its size and
location in the superblock and set MD_FEATURE_PPL bit. Write an initial
empty header in the PPL area on each device. PPL is stored in the
metadata region reserved for internal write-intent bitmap, so don't
allow using bitmap and PPL together.

While at it, fix two endianness issues in write_empty_r5l_meta_block()
and write_init_super1().

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Assemble.c    |   3 ++
 Create.c      |   2 +
 Grow.c        |  15 +++++-
 Incremental.c |   3 ++
 mdadm.h       |   1 +
 super1.c      | 150 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 155 insertions(+), 19 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 8e55b49f..c0984201 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -962,6 +962,9 @@ static int start_array(int mdfd,
 		c->readonly = 1;
 	}
 
+	if (content->consistency_policy == CONSISTENCY_POLICY_PPL)
+		clean = 1;
+
 	rv = set_array_info(mdfd, st, content);
 	if (rv && !err_ok) {
 		pr_err("failed to set array info for %s: %s\n",
diff --git a/Create.c b/Create.c
index 4080bf69..10e7d108 100644
--- a/Create.c
+++ b/Create.c
@@ -524,6 +524,8 @@ int Create(struct supertype *st, char *mddev,
 	if (!s->bitmap_file &&
 	    s->level >= 1 &&
 	    st->ss->add_internal_bitmap &&
+	    (s->consistency_policy != CONSISTENCY_POLICY_RESYNC &&
+	     s->consistency_policy != CONSISTENCY_POLICY_PPL) &&
 	    (s->write_behind || s->size > 100*1024*1024ULL)) {
 		if (c->verbose > 0)
 			pr_err("automatically enabling write-intent bitmap on large array\n");
diff --git a/Grow.c b/Grow.c
index e22661ca..a849012b 100755
--- a/Grow.c
+++ b/Grow.c
@@ -290,6 +290,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 	int major = BITMAP_MAJOR_HI;
 	int vers = md_get_version(fd);
 	unsigned long long bitmapsize, array_size;
+	struct mdinfo *mdi;
 
 	if (vers < 9003) {
 		major = BITMAP_MAJOR_HOSTENDIAN;
@@ -389,12 +390,23 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 		free(st);
 		return 1;
 	}
+
+	mdi = sysfs_read(fd, NULL, GET_CONSISTENCY_POLICY);
+	if (mdi) {
+		if (mdi->consistency_policy == CONSISTENCY_POLICY_PPL) {
+			pr_err("Cannot add bitmap to array with PPL\n");
+			free(mdi);
+			free(st);
+			return 1;
+		}
+		free(mdi);
+	}
+
 	if (strcmp(s->bitmap_file, "internal") == 0 ||
 	    strcmp(s->bitmap_file, "clustered") == 0) {
 		int rv;
 		int d;
 		int offset_setable = 0;
-		struct mdinfo *mdi;
 		if (st->ss->add_internal_bitmap == NULL) {
 			pr_err("Internal bitmaps not supported with %s metadata\n", st->ss->name);
 			return 1;
@@ -446,6 +458,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 			sysfs_init(mdi, fd, NULL);
 			rv = sysfs_set_num_signed(mdi, NULL, "bitmap/location",
 						  mdi->bitmap_offset);
+			free(mdi);
 		} else {
 			if (strcmp(s->bitmap_file, "clustered") == 0)
 				array.state |= (1 << MD_SB_CLUSTERED);
diff --git a/Incremental.c b/Incremental.c
index 0f507bb3..81afc7ec 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -528,6 +528,9 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 
 	journal_device_missing = (info.journal_device_required) && (info.journal_clean == 0);
 
+	if (info.consistency_policy == CONSISTENCY_POLICY_PPL)
+		info.array.state |= 1;
+
 	if (enough(info.array.level, info.array.raid_disks,
 		   info.array.layout, info.array.state & 1,
 		   avail) == 0) {
diff --git a/mdadm.h b/mdadm.h
index d222cc3b..2c7066dc 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -302,6 +302,7 @@ struct mdinfo {
 	long			bitmap_offset;	/* 0 == none, 1 == a file */
 	unsigned int		ppl_size;
 	unsigned long long	ppl_sector;
+	int			ppl_offset;
 	unsigned long		safe_mode_delay; /* ms delay to mark clean */
 	int			new_level, delta_disks, new_layout, new_chunk;
 	int			errors;
diff --git a/super1.c b/super1.c
index 8df17a1c..409b6c32 100644
--- a/super1.c
+++ b/super1.c
@@ -48,10 +48,18 @@ struct mdp_superblock_1 {
 
 	__u32	chunksize;	/* in 512byte sectors */
 	__u32	raid_disks;
-	__u32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
-				 * NOTE: signed, so bitmap can be before superblock
-				 * only meaningful of feature_map[0] is set.
-				 */
+	union {
+		__u32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
+					 * NOTE: signed, so bitmap can be before superblock
+					 * only meaningful of feature_map[0] is set.
+					 */
+
+		/* only meaningful when feature_map[MD_FEATURE_PPL] is set */
+		struct {
+			__s16 offset; /* sectors from start of superblock that ppl starts */
+			__u16 size; /* ppl size in sectors */
+		} ppl;
+	};
 
 	/* These are only valid with feature bit '4' */
 	__u32	new_level;	/* new level we are reshaping to		*/
@@ -131,6 +139,7 @@ struct misc_dev_info {
 #define	MD_FEATURE_NEW_OFFSET		64 /* new_offset must be honoured */
 #define	MD_FEATURE_BITMAP_VERSIONED	256 /* bitmap version number checked properly */
 #define	MD_FEATURE_JOURNAL		512 /* support write journal */
+#define	MD_FEATURE_PPL			1024 /* support PPL */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
@@ -140,6 +149,7 @@ struct misc_dev_info {
 					|MD_FEATURE_NEW_OFFSET		\
 					|MD_FEATURE_BITMAP_VERSIONED	\
 					|MD_FEATURE_JOURNAL		\
+					|MD_FEATURE_PPL			\
 					)
 
 #ifndef MDASSEMBLE
@@ -289,6 +299,11 @@ static int awrite(struct align_fd *afd, void *buf, int len)
 	return len;
 }
 
+static inline unsigned int choose_ppl_space(int chunk)
+{
+	return (PPL_HEADER_SIZE >> 9) + (chunk > 128*2 ? chunk : 128*2);
+}
+
 #ifndef MDASSEMBLE
 static void examine_super1(struct supertype *st, char *homehost)
 {
@@ -392,6 +407,10 @@ static void examine_super1(struct supertype *st, char *homehost)
 	if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
 		printf("Internal Bitmap : %ld sectors from superblock\n",
 		       (long)(int32_t)__le32_to_cpu(sb->bitmap_offset));
+	} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+		printf("            PPL : %u sectors at offset %d sectors from superblock\n",
+		       __le16_to_cpu(sb->ppl.size),
+		       __le16_to_cpu(sb->ppl.offset));
 	}
 	if (sb->feature_map & __cpu_to_le32(MD_FEATURE_RESHAPE_ACTIVE)) {
 		printf("  Reshape pos'n : %llu%s\n", (unsigned long long)__le64_to_cpu(sb->reshape_position)/2,
@@ -934,10 +953,16 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	if (__le32_to_cpu(bsb->nodes) > 1)
 		info->array.state |= (1 << MD_SB_CLUSTERED);
 
+	super_offset = __le64_to_cpu(sb->super_offset);
 	info->data_offset = __le64_to_cpu(sb->data_offset);
 	info->component_size = __le64_to_cpu(sb->size);
-	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_BITMAP_OFFSET))
+	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_BITMAP_OFFSET)) {
 		info->bitmap_offset = (int32_t)__le32_to_cpu(sb->bitmap_offset);
+	} else if (sb->feature_map & __le32_to_cpu(MD_FEATURE_PPL)) {
+		info->ppl_offset = __le16_to_cpu(sb->ppl.offset);
+		info->ppl_size = __le16_to_cpu(sb->ppl.size);
+		info->ppl_sector = super_offset + info->ppl_offset;
+	}
 
 	info->disk.major = 0;
 	info->disk.minor = 0;
@@ -948,7 +973,6 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	else
 		role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
 
-	super_offset = __le64_to_cpu(sb->super_offset);
 	if (info->array.level <= 0)
 		data_size = __le64_to_cpu(sb->data_size);
 	else
@@ -965,8 +989,8 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 				end = bboffset;
 		}
 
-		if (super_offset + info->bitmap_offset < end)
-			end = super_offset + info->bitmap_offset;
+		if (super_offset + info->bitmap_offset + info->ppl_offset < end)
+			end = super_offset + info->bitmap_offset + info->ppl_offset;
 
 		if (info->data_offset + data_size < end)
 			info->space_after = end - data_size - info->data_offset;
@@ -982,6 +1006,11 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 			bmend += size;
 			if (bmend > earliest)
 				earliest = bmend;
+		} else if (info->ppl_offset > 0) {
+			unsigned long long pplend = info->ppl_offset +
+						    info->ppl_size;
+			if (pplend > earliest)
+				earliest = pplend;
 		}
 		if (sb->bblog_offset && sb->bblog_size) {
 			unsigned long long bbend = super_offset;
@@ -1075,8 +1104,20 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	}
 
 	info->array.working_disks = working;
-	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_JOURNAL))
+
+	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_JOURNAL)) {
 		info->journal_device_required = 1;
+		info->consistency_policy = CONSISTENCY_POLICY_JOURNAL;
+	} else if (sb->feature_map & __le32_to_cpu(MD_FEATURE_PPL)) {
+		info->consistency_policy = CONSISTENCY_POLICY_PPL;
+	} else if (sb->feature_map & __le32_to_cpu(MD_FEATURE_BITMAP_OFFSET)) {
+		info->consistency_policy = CONSISTENCY_POLICY_BITMAP;
+	} else if (info->array.level <= 0) {
+		info->consistency_policy = CONSISTENCY_POLICY_NONE;
+	} else {
+		info->consistency_policy = CONSISTENCY_POLICY_RESYNC;
+	}
+
 	info->journal_clean = 0;
 }
 
@@ -1239,6 +1280,9 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
 			bitmap_offset = (long)__le32_to_cpu(sb->bitmap_offset);
 			bm_sectors = calc_bitmap_size(bms, 4096) >> 9;
+		} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+			bitmap_offset = (long)__le16_to_cpu(sb->ppl.offset);
+			bm_sectors = (long)__le16_to_cpu(sb->ppl.size);
 		}
 #endif
 		if (sb_offset < data_offset) {
@@ -1472,6 +1516,9 @@ static int init_super1(struct supertype *st, mdu_array_info_t *info,
 
 	memset(sb->dev_roles, 0xff, MAX_SB_SIZE - sizeof(struct mdp_superblock_1));
 
+	if (s->consistency_policy == CONSISTENCY_POLICY_PPL)
+		sb->feature_map |= __cpu_to_le32(MD_FEATURE_PPL);
+
 	return 1;
 }
 
@@ -1645,10 +1692,49 @@ static unsigned long choose_bm_space(unsigned long devsize)
 
 static void free_super1(struct supertype *st);
 
-#define META_BLOCK_SIZE 4096
+#ifndef MDASSEMBLE
+
 __u32 crc32c_le(__u32 crc, unsigned char const *p, size_t len);
 
-#ifndef MDASSEMBLE
+static int write_init_ppl1(struct supertype *st, struct mdinfo *info, int fd)
+{
+	struct mdp_superblock_1 *sb = st->sb;
+	void *buf;
+	struct ppl_header *ppl_hdr;
+	int ret;
+
+	ret = posix_memalign(&buf, 4096, PPL_HEADER_SIZE);
+	if (ret) {
+		pr_err("Failed to allocate PPL header buffer\n");
+		return ret;
+	}
+
+	memset(buf, 0, PPL_HEADER_SIZE);
+	ppl_hdr = buf;
+	memset(ppl_hdr->reserved, 0xff, PPL_HDR_RESERVED);
+	ppl_hdr->signature = __cpu_to_le32(~crc32c_le(~0, sb->set_uuid,
+						      sizeof(sb->set_uuid)));
+	ppl_hdr->checksum = __cpu_to_le32(~crc32c_le(~0, buf, PPL_HEADER_SIZE));
+
+	if (lseek64(fd, info->ppl_sector * 512, SEEK_SET) < 0) {
+		ret = errno;
+		perror("Failed to seek to PPL header location");
+	}
+
+	if (!ret && write(fd, buf, PPL_HEADER_SIZE) != PPL_HEADER_SIZE) {
+		ret = errno;
+		perror("Write PPL header failed");
+	}
+
+	if (!ret)
+		fsync(fd);
+
+	free(buf);
+	return ret;
+}
+
+#define META_BLOCK_SIZE 4096
+
 static int write_empty_r5l_meta_block(struct supertype *st, int fd)
 {
 	struct r5l_meta_block *mb;
@@ -1675,7 +1761,7 @@ static int write_empty_r5l_meta_block(struct supertype *st, int fd)
 	crc = crc32c_le(crc, (void *)mb, META_BLOCK_SIZE);
 	mb->checksum = crc;
 
-	if (lseek64(fd, (sb->data_offset) * 512, 0) < 0LL) {
+	if (lseek64(fd, __le64_to_cpu(sb->data_offset) * 512, 0) < 0LL) {
 		pr_err("cannot seek to offset of the meta block\n");
 		goto fail_to_write;
 	}
@@ -1708,7 +1794,7 @@ static int write_init_super1(struct supertype *st)
 
 	for (di = st->info; di; di = di->next) {
 		if (di->disk.state & (1 << MD_DISK_JOURNAL))
-			sb->feature_map |= MD_FEATURE_JOURNAL;
+			sb->feature_map |= __cpu_to_le32(MD_FEATURE_JOURNAL);
 	}
 
 	for (di = st->info; di; di = di->next) {
@@ -1783,6 +1869,21 @@ static int write_init_super1(struct supertype *st)
 					(((char *)sb) + MAX_SB_SIZE);
 			bm_space = calc_bitmap_size(bms, 4096) >> 9;
 			bm_offset = (long)__le32_to_cpu(sb->bitmap_offset);
+		} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+			bm_space = choose_ppl_space(__le32_to_cpu(sb->chunksize));
+			if (bm_space > UINT16_MAX)
+				bm_space = UINT16_MAX;
+			if (st->minor_version == 0) {
+				bm_offset = -bm_space - 8;
+				if (bm_offset < INT16_MIN) {
+					bm_offset = INT16_MIN;
+					bm_space = -bm_offset - 8;
+				}
+			} else {
+				bm_offset = 8;
+			}
+			sb->ppl.offset = __cpu_to_le16(bm_offset);
+			sb->ppl.size = __cpu_to_le16(bm_space);
 		} else {
 			bm_space = choose_bm_space(array_size);
 			bm_offset = 8;
@@ -1854,8 +1955,17 @@ static int write_init_super1(struct supertype *st)
 				goto error_out;
 		}
 
-		if (rv == 0 && (__le32_to_cpu(sb->feature_map) & 1))
+		if (rv == 0 &&
+		    (__le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET)) {
 			rv = st->ss->write_bitmap(st, di->fd, NodeNumUpdate);
+		} else if (rv == 0 &&
+			 (__le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL)) {
+			struct mdinfo info;
+
+			st->ss->getinfo_super(st, &info, NULL);
+			rv = st->ss->write_init_ppl(st, &info, di->fd);
+		}
+
 		close(di->fd);
 		di->fd = -1;
 		if (rv)
@@ -2123,11 +2233,13 @@ static __u64 avail_size1(struct supertype *st, __u64 devsize,
 		return 0;
 
 #ifndef MDASSEMBLE
-	if (__le32_to_cpu(super->feature_map)&MD_FEATURE_BITMAP_OFFSET) {
+	if (__le32_to_cpu(super->feature_map) & MD_FEATURE_BITMAP_OFFSET) {
 		/* hot-add. allow for actual size of bitmap */
 		struct bitmap_super_s *bsb;
 		bsb = (struct bitmap_super_s *)(((char*)super)+MAX_SB_SIZE);
 		bmspace = calc_bitmap_size(bsb, 4096) >> 9;
+	} else if (__le32_to_cpu(super->feature_map) & MD_FEATURE_PPL) {
+		bmspace = __le16_to_cpu(super->ppl.size);
 	}
 #endif
 	/* Allow space for bad block log */
@@ -2530,8 +2642,9 @@ static int validate_geometry1(struct supertype *st, int level,
 		return 0;
 	}
 
-	/* creating:  allow suitable space for bitmap */
-	bmspace = choose_bm_space(devsize);
+	/* creating:  allow suitable space for bitmap or PPL */
+	bmspace = consistency_policy == CONSISTENCY_POLICY_PPL ?
+		  choose_ppl_space((*chunk)*2) : choose_bm_space(devsize);
 
 	if (data_offset == INVALID_SECTORS)
 		data_offset = st->data_offset;
@@ -2566,7 +2679,7 @@ static int validate_geometry1(struct supertype *st, int level,
 	switch(st->minor_version) {
 	case 0: /* metadata at end.  Round down and subtract space to reserve */
 		devsize = (devsize & ~(4ULL*2-1));
-		/* space for metadata, bblog, bitmap */
+		/* space for metadata, bblog, bitmap/ppl */
 		devsize -= 8*2 + 8 + bmspace;
 		break;
 	case 1:
@@ -2642,6 +2755,7 @@ struct superswitch super1 = {
 	.add_to_super = add_to_super1,
 	.examine_badblocks = examine_badblocks_super1,
 	.copy_metadata = copy_metadata1,
+	.write_init_ppl = write_init_ppl1,
 #endif
 	.match_home = match_home1,
 	.uuid_from_super = uuid_from_super1,
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 5/6] Add 'ppl' and 'no-ppl' options for --update=
From: Artur Paszkiewicz @ 2017-03-29  9:54 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170329095420.17855-1-artur.paszkiewicz@intel.com>

This can be used with --assemble for super1 and with --update-subarray
for imsm to enable or disable PPL in the metadata.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Assemble.c    |  6 ++++++
 mdadm.8.in    | 27 ++++++++++++++++++++++++---
 mdadm.c       |  6 +++++-
 super-intel.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 super1.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 139 insertions(+), 4 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index c0984201..6a6a56bf 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -602,6 +602,12 @@ static int load_devices(struct devs *devices, char *devmap,
 			if (strcmp(c->update, "uuid") == 0 && !ident->uuid_set)
 				random_uuid((__u8 *)ident->uuid);
 
+			if (strcmp(c->update, "ppl") == 0 &&
+			    ident->bitmap_fd >= 0) {
+				pr_err("PPL is not compatible with bitmap\n");
+				return -1;
+			}
+
 			dfd = dev_open(devname,
 				       tmpdev->disposition == 'I'
 				       ? O_RDWR : (O_RDWR|O_EXCL));
diff --git a/mdadm.8.in b/mdadm.8.in
index cad5db53..1178ed9b 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -1176,6 +1176,8 @@ argument given to this flag can be one of
 .BR no\-bitmap ,
 .BR bbl ,
 .BR no\-bbl ,
+.BR ppl ,
+.BR no\-ppl ,
 .BR metadata ,
 or
 .BR super\-minor .
@@ -1316,6 +1318,16 @@ option will cause any reservation of space for a bad block list to be
 removed.  If the bad block list contains entries, this will fail, as
 removing the list could cause data corruption.
 
+The
+.B ppl
+option will enable PPL for a RAID5 array and reserve space for PPL on each
+device. There must be enough free space between the data and superblock and a
+write-intent bitmap or journal must not be used.
+
+The
+.B no\-ppl
+option will disable PPL in the superblock.
+
 .TP
 .BR \-\-freeze\-reshape
 Option is intended to be used in start-up scripts during initrd boot phase.
@@ -2327,9 +2339,11 @@ superblock field in the subarray.  Similar to updating an array in
 .B \-U
 or
 .B \-\-update=
-option.  Currently only
-.B name
-is supported.
+option. The supported options are
+.BR name ,
+.B ppl
+and
+.BR no\-ppl .
 
 The
 .B name
@@ -2340,6 +2354,13 @@ re\-assembled.  If updating
 would change the UUID of an active subarray this operation is blocked,
 and the command will end in an error.
 
+The
+.B ppl
+and
+.B no\-ppl
+options enable and disable PPL in the metadata. Currently supported only for
+IMSM subarrays.
+
 .TP
 .B \-\-examine
 The device should be a component of an md array.
diff --git a/mdadm.c b/mdadm.c
index d4e82866..6edf3abb 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -769,6 +769,10 @@ int main(int argc, char *argv[])
 				continue;
 			if (strcmp(c.update, "force-no-bbl") == 0)
 				continue;
+			if (strcmp(c.update, "ppl") == 0)
+				continue;
+			if (strcmp(c.update, "no-ppl") == 0)
+				continue;
 			if (strcmp(c.update, "metadata") == 0)
 				continue;
 			if (strcmp(c.update, "revert-reshape") == 0)
@@ -802,7 +806,7 @@ int main(int argc, char *argv[])
 		"     'sparc2.2', 'super-minor', 'uuid', 'name', 'nodes', 'resync',\n"
 		"     'summaries', 'homehost', 'home-cluster', 'byteorder', 'devicesize',\n"
 		"     'no-bitmap', 'metadata', 'revert-reshape'\n"
-		"     'bbl', 'no-bbl', 'force-no-bbl'\n"
+		"     'bbl', 'no-bbl', 'force-no-bbl', 'ppl', 'no-ppl'\n"
 				);
 			exit(outf == stdout ? 0 : 2);
 
diff --git a/super-intel.c b/super-intel.c
index 87fec8bc..785488a5 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -451,6 +451,7 @@ enum imsm_update_type {
 	update_general_migration_checkpoint,
 	update_size_change,
 	update_prealloc_badblocks_mem,
+	update_rwh_policy,
 };
 
 struct imsm_update_activate_spare {
@@ -543,6 +544,12 @@ struct imsm_update_prealloc_bb_mem {
 	enum imsm_update_type type;
 };
 
+struct imsm_update_rwh_policy {
+	enum imsm_update_type type;
+	int new_policy;
+	int dev_idx;
+};
+
 static const char *_sys_dev_type[] = {
 	[SYS_DEV_UNKNOWN] = "Unknown",
 	[SYS_DEV_SAS] = "SAS",
@@ -7373,6 +7380,34 @@ static int update_subarray_imsm(struct supertype *st, char *subarray,
 			}
 			super->updates_pending++;
 		}
+	} else if (strcmp(update, "ppl") == 0 ||
+		   strcmp(update, "no-ppl") == 0) {
+		int new_policy;
+		char *ep;
+		int vol = strtoul(subarray, &ep, 10);
+
+		if (*ep != '\0' || vol >= super->anchor->num_raid_devs)
+			return 2;
+
+		if (strcmp(update, "ppl") == 0)
+			new_policy = RWH_DISTRIBUTED;
+		else
+			new_policy = RWH_OFF;
+
+		if (st->update_tail) {
+			struct imsm_update_rwh_policy *u = xmalloc(sizeof(*u));
+
+			u->type = update_rwh_policy;
+			u->dev_idx = vol;
+			u->new_policy = new_policy;
+			append_metadata_update(st, u, sizeof(*u));
+		} else {
+			struct imsm_dev *dev;
+
+			dev = get_imsm_dev(super, vol);
+			dev->rwh_policy = new_policy;
+			super->updates_pending++;
+		}
 	} else
 		return 2;
 
@@ -9599,6 +9634,21 @@ static void imsm_process_update(struct supertype *st,
 	}
 	case update_prealloc_badblocks_mem:
 		break;
+	case update_rwh_policy: {
+		struct imsm_update_rwh_policy *u = (void *)update->buf;
+		int target = u->dev_idx;
+		struct imsm_dev *dev = get_imsm_dev(super, target);
+		if (!dev) {
+			dprintf("could not find subarray-%d\n", target);
+			break;
+		}
+
+		if (dev->rwh_policy != u->new_policy) {
+			dev->rwh_policy = u->new_policy;
+			super->updates_pending++;
+		}
+		break;
+	}
 	default:
 		pr_err("error: unsuported process update type:(type: %d)\n",	type);
 	}
@@ -9844,6 +9894,11 @@ static int imsm_prepare_update(struct supertype *st,
 		super->extra_space += sizeof(struct bbm_log) -
 			get_imsm_bbm_log_size(super->bbm_log);
 		break;
+	case update_rwh_policy: {
+		if (update->len < (int)sizeof(struct imsm_update_rwh_policy))
+			return 0;
+		break;
+	}
 	default:
 		return 0;
 	}
diff --git a/super1.c b/super1.c
index 409b6c32..e76f7777 100644
--- a/super1.c
+++ b/super1.c
@@ -1325,6 +1325,55 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		sb->bblog_size = 0;
 		sb->bblog_shift = 0;
 		sb->bblog_offset = 0;
+	} else if (strcmp(update, "ppl") == 0) {
+		unsigned long long sb_offset = __le64_to_cpu(sb->super_offset);
+		unsigned long long data_offset = __le64_to_cpu(sb->data_offset);
+		unsigned long long data_size = __le64_to_cpu(sb->data_size);
+		long bb_offset = __le32_to_cpu(sb->bblog_offset);
+		int space;
+		int optimal_space;
+		int offset;
+
+		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
+			pr_err("Cannot add PPL to array with bitmap\n");
+			return -2;
+		}
+
+		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_JOURNAL)) {
+			pr_err("Cannot add PPL to array with journal\n");
+			return -2;
+		}
+
+		if (sb_offset < data_offset) {
+			if (bb_offset)
+				space = bb_offset - 8;
+			else
+				space = data_offset - sb_offset - 8;
+			offset = 8;
+		} else {
+			offset = -(sb_offset - data_offset - data_size);
+			if (offset < INT16_MIN)
+				offset = INT16_MIN;
+			space = -(offset - bb_offset);
+		}
+
+		if (space < (PPL_HEADER_SIZE >> 9) + 8) {
+			pr_err("Not enough space to add ppl\n");
+			return -2;
+		}
+
+		optimal_space = choose_ppl_space(__le32_to_cpu(sb->chunksize));
+
+		if (space > optimal_space)
+			space = optimal_space;
+		if (space > UINT16_MAX)
+			space = UINT16_MAX;
+
+		sb->ppl.offset = __cpu_to_le16(offset);
+		sb->ppl.size = __cpu_to_le16(space);
+		sb->feature_map |= __cpu_to_le32(MD_FEATURE_PPL);
+	} else if (strcmp(update, "no-ppl") == 0) {
+		sb->feature_map &= ~ __cpu_to_le32(MD_FEATURE_PPL);
 	} else if (strcmp(update, "name") == 0) {
 		if (info->name[0] == 0)
 			sprintf(info->name, "%d", info->array.md_minor);
-- 
2.11.0


^ permalink raw reply related

* [PATCH v4 6/6] Grow: support consistency policy change
From: Artur Paszkiewicz @ 2017-03-29  9:54 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170329095420.17855-1-artur.paszkiewicz@intel.com>

Extend the --consistency-policy parameter to work also in Grow mode.
Using it changes the currently active consistency policy in the kernel
driver and updates the metadata to make this change permanent. Currently
this supports only changing between "ppl" and "resync" policies, that is
enabling or disabling PPL at runtime.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Grow.c     | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ReadMe.c   |  46 +++++++++--------
 mdadm.8.in |  18 ++++++-
 mdadm.c    |   3 ++
 mdadm.h    |   2 +
 5 files changed, 218 insertions(+), 23 deletions(-)

diff --git a/Grow.c b/Grow.c
index a849012b..b86b53ec 100755
--- a/Grow.c
+++ b/Grow.c
@@ -528,6 +528,178 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 	return 0;
 }
 
+int Grow_consistency_policy(char *devname, int fd, struct context *c, struct shape *s)
+{
+	struct supertype *st;
+	struct mdinfo *sra;
+	struct mdinfo *sd;
+	char *subarray = NULL;
+	int ret = 0;
+	char container_dev[PATH_MAX];
+
+	if (s->consistency_policy != CONSISTENCY_POLICY_RESYNC &&
+	    s->consistency_policy != CONSISTENCY_POLICY_PPL) {
+		pr_err("Operation not supported for consistency policy %s\n",
+		       map_num(consistency_policies, s->consistency_policy));
+		return 1;
+	}
+
+	st = super_by_fd(fd, &subarray);
+	if (!st)
+		return 1;
+
+	sra = sysfs_read(fd, NULL, GET_CONSISTENCY_POLICY|GET_LEVEL|
+				   GET_DEVS|GET_STATE);
+	if (!sra) {
+		ret = 1;
+		goto free_st;
+	}
+
+	if (s->consistency_policy == CONSISTENCY_POLICY_PPL &&
+	    !st->ss->write_init_ppl) {
+		pr_err("%s metadata does not support PPL\n", st->ss->name);
+		ret = 1;
+		goto free_info;
+	}
+
+	if (sra->array.level != 5) {
+		pr_err("Operation not supported for array level %d\n",
+				sra->array.level);
+		ret = 1;
+		goto free_info;
+	}
+
+	if (sra->consistency_policy == (unsigned)s->consistency_policy) {
+		pr_err("Consistency policy is already %s\n",
+		       map_num(consistency_policies, s->consistency_policy));
+		ret = 1;
+		goto free_info;
+	} else if (sra->consistency_policy != CONSISTENCY_POLICY_RESYNC &&
+		   sra->consistency_policy != CONSISTENCY_POLICY_PPL) {
+		pr_err("Current consistency policy is %s, cannot change to %s\n",
+		       map_num(consistency_policies, sra->consistency_policy),
+		       map_num(consistency_policies, s->consistency_policy));
+		ret = 1;
+		goto free_info;
+	}
+
+	if (subarray) {
+		char *update;
+
+		if (s->consistency_policy == CONSISTENCY_POLICY_PPL)
+			update = "ppl";
+		else
+			update = "no-ppl";
+
+		sprintf(container_dev, "/dev/%s", st->container_devnm);
+
+		ret = Update_subarray(container_dev, subarray, update, NULL,
+				      c->verbose);
+		if (ret)
+			goto free_info;
+	}
+
+	if (s->consistency_policy == CONSISTENCY_POLICY_PPL) {
+		struct mdinfo info;
+
+		if (subarray) {
+			struct mdinfo *mdi;
+			int cfd;
+
+			cfd = open(container_dev, O_RDWR|O_EXCL);
+			if (cfd < 0) {
+				pr_err("Failed to open %s\n", container_dev);
+				ret = 1;
+				goto free_info;
+			}
+
+			ret = st->ss->load_container(st, cfd, st->container_devnm);
+			close(cfd);
+
+			if (ret) {
+				pr_err("Cannot read superblock for %s\n",
+				       container_dev);
+				goto free_info;
+			}
+
+			mdi = st->ss->container_content(st, subarray);
+			info = *mdi;
+			free(mdi);
+		}
+
+		for (sd = sra->devs; sd; sd = sd->next) {
+			int dfd;
+			char *devpath;
+
+			if ((sd->disk.state & (1 << MD_DISK_SYNC)) == 0)
+				continue;
+
+			devpath = map_dev(sd->disk.major, sd->disk.minor, 0);
+			dfd = dev_open(devpath, O_RDWR);
+			if (dfd < 0) {
+				pr_err("Failed to open %s\n", devpath);
+				ret = 1;
+				goto free_info;
+			}
+
+			if (!subarray) {
+				ret = st->ss->load_super(st, dfd, NULL);
+				if (ret) {
+					pr_err("Failed to load super-block.\n");
+					close(dfd);
+					goto free_info;
+				}
+
+				ret = st->ss->update_super(st, sra, "ppl", devname,
+							   c->verbose, 0, NULL);
+				if (ret) {
+					close(dfd);
+					st->ss->free_super(st);
+					goto free_info;
+				}
+				st->ss->getinfo_super(st, &info, NULL);
+			}
+
+			ret |= sysfs_set_num(sra, sd, "ppl_sector", info.ppl_sector);
+			ret |= sysfs_set_num(sra, sd, "ppl_size", info.ppl_size);
+
+			if (ret) {
+				pr_err("Failed to set PPL attributes for %s\n",
+				       sd->sys_name);
+				close(dfd);
+				st->ss->free_super(st);
+				goto free_info;
+			}
+
+			ret = st->ss->write_init_ppl(st, &info, dfd);
+			if (ret)
+				pr_err("Failed to write PPL\n");
+
+			close(dfd);
+
+			if (!subarray)
+				st->ss->free_super(st);
+
+			if (ret)
+				goto free_info;
+		}
+	}
+
+	ret = sysfs_set_str(sra, NULL, "consistency_policy",
+			    map_num(consistency_policies,
+				    s->consistency_policy));
+	if (ret)
+		pr_err("Failed to change array consistency policy\n");
+
+free_info:
+	sysfs_free(sra);
+free_st:
+	free(st);
+	free(subarray);
+
+	return ret;
+}
+
 /*
  * When reshaping an array we might need to backup some data.
  * This is written to all spares with a 'super_block' describing it.
diff --git a/ReadMe.c b/ReadMe.c
index fc04c2c3..eb8fb4b7 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -559,28 +559,30 @@ char Help_grow[] =
 "reconfiguration.\n"
 "\n"
 "Options that are valid with the grow (-G --grow) mode are:\n"
-"  --level=       -l   : Tell mdadm what level to convert the array to.\n"
-"  --layout=      -p   : For a FAULTY array, set/change the error mode.\n"
-"                      : for other arrays, update the layout\n"
-"  --size=        -z   : Change the active size of devices in an array.\n"
-"                      : This is useful if all devices have been replaced\n"
-"                      : with larger devices.   Value is in Kilobytes, or\n"
-"                      : the special word 'max' meaning 'as large as possible'.\n"
-"  --assume-clean      : When increasing the --size, this flag will avoid\n"
-"                      : a resync of the new space\n"
-"  --chunk=       -c   : Change the chunksize of the array\n"
-"  --raid-devices= -n  : Change the number of active devices in an array.\n"
-"  --add=         -a   : Add listed devices as part of reshape.  This is\n"
-"                      : needed for resizing a RAID0 which cannot have\n"
-"                      : spares already present.\n"
-"  --bitmap=      -b   : Add or remove a write-intent bitmap.\n"
-"  --backup-file= file : A file on a different device to store data for a\n"
-"                      : short time while increasing raid-devices on a\n"
-"                      : RAID4/5/6 array. Also needed throughout a reshape\n"
-"                      : when changing parameters other than raid-devices\n"
-"  --array-size=  -Z   : Change visible size of array.  This does not change\n"
-"                      : any data on the device, and is not stable across restarts.\n"
-"  --data-offset=      : Location on device to move start of data to.\n"
+"  --level=           -l : Tell mdadm what level to convert the array to.\n"
+"  --layout=          -p : For a FAULTY array, set/change the error mode.\n"
+"                        : for other arrays, update the layout\n"
+"  --size=            -z : Change the active size of devices in an array.\n"
+"                        : This is useful if all devices have been replaced\n"
+"                        : with larger devices.   Value is in Kilobytes, or\n"
+"                        : the special word 'max' meaning 'as large as possible'.\n"
+"  --assume-clean        : When increasing the --size, this flag will avoid\n"
+"                        : a resync of the new space\n"
+"  --chunk=           -c : Change the chunksize of the array\n"
+"  --raid-devices=    -n : Change the number of active devices in an array.\n"
+"  --add=             -a : Add listed devices as part of reshape.  This is\n"
+"                        : needed for resizing a RAID0 which cannot have\n"
+"                        : spares already present.\n"
+"  --bitmap=          -b : Add or remove a write-intent bitmap.\n"
+"  --backup-file= file   : A file on a different device to store data for a\n"
+"                        : short time while increasing raid-devices on a\n"
+"                        : RAID4/5/6 array. Also needed throughout a reshape\n"
+"                        : when changing parameters other than raid-devices\n"
+"  --array-size=      -Z : Change visible size of array. This does not change any\n"
+"                        : data on the device, and is not stable across restarts.\n"
+"  --data-offset=        : Location on device to move start of data to.\n"
+"  --consistency-policy= : Change the consistency policy of an active array.\n"
+"                     -k : Currently works only for PPL with RAID5.\n"
 ;
 
 char Help_incr[] =
diff --git a/mdadm.8.in b/mdadm.8.in
index 1178ed9b..744c12b5 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -126,7 +126,7 @@ of component devices and changing the number of active devices in
 Linear and RAID levels 0/1/4/5/6,
 changing the RAID level between 0, 1, 5, and 6, and between 0 and 10,
 changing the chunk size and layout for RAID 0,4,5,6,10 as well as adding or
-removing a write-intent bitmap.
+removing a write-intent bitmap and changing the array's consistency policy.
 
 .TP
 .B "Incremental Assembly"
@@ -1050,6 +1050,10 @@ after unclean shutdown. Implicitly selected when using
 For RAID5 only, Partial Parity Log is used to close the write hole and
 eliminate resync. PPL is stored in the metadata region of RAID member drives,
 no additional journal drive is needed.
+
+.PP
+Can be used with \-\-grow to change the consistency policy of an active array
+in some cases. See CONSISTENCY POLICY CHANGES below.
 .RE
 
 
@@ -2694,6 +2698,8 @@ RAID0, RAID4, and RAID5, and between RAID0 and RAID10 (in the near-2 mode).
 .IP \(bu 4
 add a write-intent bitmap to any array which supports these bitmaps, or
 remove a write-intent bitmap from such an array.
+.IP \(bu 4
+change the array's consistency policy.
 .PP
 
 Using GROW on containers is currently supported only for Intel's IMSM
@@ -2850,6 +2856,16 @@ can be added.  Note that if you add a bitmap stored in a file which is
 in a filesystem that is on the RAID array being affected, the system
 will deadlock.  The bitmap must be on a separate filesystem.
 
+.SS CONSISTENCY POLICY CHANGES
+
+The consistency policy of an active array can be changed by using the
+.B \-\-consistency\-policy
+option in Grow mode. Currently this works only for the
+.B ppl
+and
+.B resync
+policies and allows to enable or disable the RAID5 Partial Parity Log (PPL).
+
 .SH INCREMENTAL MODE
 
 .HP 12
diff --git a/mdadm.c b/mdadm.c
index 6edf3abb..5ebf1175 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1221,6 +1221,7 @@ int main(int argc, char *argv[])
 			s.journaldisks = 1;
 			continue;
 		case O(CREATE, 'k'):
+		case O(GROW, 'k'):
 			s.consistency_policy = map_name(consistency_policies,
 							optarg);
 			if (s.consistency_policy == UnSet ||
@@ -1679,6 +1680,8 @@ int main(int argc, char *argv[])
 			rv = Grow_reshape(devlist->devname, mdfd,
 					  devlist->next,
 					  data_offset, &c, &s);
+		} else if (s.consistency_policy != UnSet) {
+			rv = Grow_consistency_policy(devlist->devname, mdfd, &c, &s);
 		} else if (array_size == 0)
 			pr_err("no changes to --grow\n");
 		break;
diff --git a/mdadm.h b/mdadm.h
index 2c7066dc..4891acf5 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1331,6 +1331,8 @@ extern int Grow_restart(struct supertype *st, struct mdinfo *info,
 extern int Grow_continue(int mdfd, struct supertype *st,
 			 struct mdinfo *info, char *backup_file,
 			 int forked, int freeze_reshape);
+extern int Grow_consistency_policy(char *devname, int fd,
+				   struct context *c, struct shape *s);
 
 extern int restore_backup(struct supertype *st,
 			  struct mdinfo *content,
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Paolo Bonzini @ 2017-03-29 14:51 UTC (permalink / raw)
  To: Bart Van Assche, agk@redhat.com, lars.ellenberg@linbit.com,
	snitzer@redhat.com, hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490726988.2573.16.camel@sandisk.com>

On 28/03/2017 20:50, Bart Van Assche wrote:
> 
> This means that just like the start and end of a discard must be aligned on a
> discard_granularity boundary, WRITE SAME commands with the UNMAP bit set must
> also respect that granularity. I think this means that either
> __blkdev_issue_zeroout() has to be modified such that it rejects unaligned
> REQ_OP_WRITE_ZEROES operations or that blk_bio_write_same_split() has to be
> modified such that it generates REQ_OP_WRITEs for the unaligned start and tail.

I don't think this is the case.

Rather, Linux should try to align the WRITE SAME commands to the optimal
unmap granularity if the zeroed area requires performing more than one
WRITE SAME command (i.e. > maximum write same length or too large to fit
in the CDB).  However, even in that case it can use WRITE SAME with
UNMAP for the unaligned start and tail.  Unlike the UNMAP command, the
SCSI standard does guarantee that zeroes are written in the unaligned parts.

Paolo

^ permalink raw reply

* Re: [PATCH 23/23] block: remove the discard_zeroes_data flag
From: Paolo Bonzini @ 2017-03-29 14:52 UTC (permalink / raw)
  To: Bart Van Assche, agk@redhat.com, lars.ellenberg@linbit.com,
	snitzer@redhat.com, hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490720411.2573.11.camel@sandisk.com>



On 28/03/2017 19:00, Bart Van Assche wrote:
> On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
>> Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
>> kill this hack.
>>
>> [ ... ]
>>
>> diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
>> index 2da04ce6aeef..dea212db9df3 100644
>> --- a/Documentation/ABI/testing/sysfs-block
>> +++ b/Documentation/ABI/testing/sysfs-block
>> @@ -213,14 +213,8 @@ What:		/sys/block/<disk>/queue/discard_zeroes_data
>>  Date:		May 2011
>>  Contact:	Martin K. Petersen <martin.petersen@oracle.com>
>>  Description:
>> -		Devices that support discard functionality may return
>> -		stale or random data when a previously discarded block
>> -		is read back. This can cause problems if the filesystem
>> -		expects discarded blocks to be explicitly cleared. If a
>> -		device reports that it deterministically returns zeroes
>> -		when a discarded area is read the discard_zeroes_data
>> -		parameter will be set to one. Otherwise it will be 0 and
>> -		the result of reading a discarded area is undefined.
>> +		Will always return 0.  Don't rely on any specific behavior
>> +		for discards, and don't read this file.
>>  
>>  What:		/sys/block/<disk>/queue/write_same_max_bytes
>>  Date:		January 2012
>>
>> [ ... ]
>>
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -208,7 +208,7 @@ static ssize_t queue_discard_max_store(struct request_queue *q,
>>  
>>  static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page)
>>  {
>> -	return queue_var_show(queue_discard_zeroes_data(q), page);
>> +	return 0;
>>  }
> 
> Hello Christoph,
> 
> It seems to me like the documentation in Documentation/ABI/testing/sysfs-block
> and the above code are not in sync. I think the above code will cause reading
> from the discard_zeroes_data attribute to return an empty string ("") instead
> of "0\n".
> 
> BTW, my personal preference is to remove this attribute entirely because keeping
> it will cause confusion, no matter how well we document the behavior of this
> attribute.

If you remove it, you should probably remove the BLKDISCARDZEROES ioctl too.

That said, the issue with discard_zeroes_data is that it is badly
defined; it was defined as "if I unmap X, will it read as zeroes?" but
this is not how the SCSI standard defines e.g. the UNMAP command with
LBPRZ=1.  But knowing something like LBPRZ ("if something is unmapped,
will it read as zeroes?") _would_ actually be useful for userspace.
This will be especially true once sd maps lseek(SEEK_HOLE/SEEK_DATA) to
the SCSI GET LBA STATUS command, or once dm-thin supports them.

Secondarily, if the former returns 1, userspace is also interested in
knowing "can REQ_OP_WRITE_ZEROES+REQ_UNMAP ever unmap anything?", i.e.
whether BLKDEV_ZERO_NOFALLBACK will ever return anything but
-EOPNOTSUPP.  For SCSI, this should intuitively mean whether LBPWS or
LBPWS10 are set, but the details depend on how the sd driver implements
REQ_OP_WRITE_ZEROES with REQ_UNMAP.

Paolo

^ permalink raw reply

* Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Paolo Bonzini @ 2017-03-29 14:57 UTC (permalink / raw)
  To: Mike Snitzer, Christoph Hellwig, Jens Axboe, Martin K. Petersen,
	Alasdair G Kergon, shli@kernel.org, philipp.reisner@linbit.com,
	linux-block@vger.kernel.org, Linux SCSI List,
	drbd-dev@lists.linbit.com, dm, linux-raid@vger.kernel.org
In-Reply-To: <20170323225256.GK1138@soda.linbit>

On 23/03/2017 23:53, Lars Ellenberg wrote:
> Thin does not claim to zero data on discard.  which is ok, and correct,
> because it only punches holes on full chunks (or whatever you call
> them), and leaves the rest in the mapping tree as is.
> 
> And that behaviour would prevent DRBD from exposing discards if
> configured on top of thin. (see above)
> 
> But thin *could* easily guarantee zeroing, by simply punching holes
> where it can, and zeroing out the not fully-aligned partial start and
> end of the range.

That's the difference between REQ_OP_DISCARD (only punches holes on full
chunks) and REQ_OP_WRITE_ZEROES with the REQ_UNMAP flag (punches holes +
zeroes incomplete chunks).

dm-thinp's REQ_OP_DISCARD should not do anything for unaligned parts.
Instead, layers above should use REQ_OP_WRITE_ZEROES (with or without
REQ_UNMAP, as required) if they need zeroes.  dm-thinp would have to
split off the partial chunks, and zero them in the lower-level device
with REQ_OP_WRITE_ZEROES.

Paolo

^ permalink raw reply

* Re: [PATCH 12/23] sd: handle REQ_UNMAP
From: Paolo Bonzini @ 2017-03-29 14:57 UTC (permalink / raw)
  To: Bart Van Assche, Alasdair G Kergon, lars.ellenberg@linbit.com,
	Mike Snitzer, Christoph Hellwig, Martin K. Petersen,
	philipp.reisner@linbit.com, Jens Axboe, shli@kernel.org
  Cc: linux-block@vger.kernel.org, linux-raid@vger.kernel.org, dm,
	Linux SCSI List, drbd-dev@lists.linbit.com
In-Reply-To: <1490719722.2573.8.camel@sandisk.com>



On 28/03/2017 18:48, Bart Van Assche wrote:
>> +	if (rq->cmd_flags & REQ_UNMAP) {
>> +		switch (sdkp->provisioning_mode) {
>> +		case SD_LBP_WS16:
>> +			return sd_setup_write_same16_cmnd(cmd, true);
>> +		case SD_LBP_WS10:
>> +			return sd_setup_write_same10_cmnd(cmd, true);
>> +		}
>> +	}
>> +
>>  	if (sdp->no_write_same)
>>  		return BLKPREP_INVALID;
>>  	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff)
> Users can change the provisioning mode from user space from SD_LBP_WS16 into
> SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector >
> 0xffffffff || nr_sectors > 0xffff) check if REQ_UNMAP is set.

Yeah, if REQ_UNMAP is set you should probably check sdkp->provisioning_mode
instead of sdkp->ws16, but apart from this it should still go through the
checks below.

Plus, if the provisioning mode is not ws10 or ws16, should
sd_setup_write_zeroes_cmnd:

1) do a WRITE SAME without UNMAP (what Christoph's code does)

2) return BLKPREP_INVALID

3) ignore provisioning mode and do a WRITE SAME with UNMAP

4) do a WRITE SAME without UNMAP for SD_LBP_{ZERO,FULL,DISABLE},
do a WRITE SAME with UNMAP for SD_LBP_{WS10,WS16,UNMAP}.

I'm in favor of (4).  The distinction between SD_LBP_UNMAP, SD_LBP_WS10
and SD_LBP_WS16 is as problematic as discard_zeroes_data in my opinion.

Thanks,

Paolo

^ permalink raw reply

* Re: [PATCH v4 0/6] mdadm support for Partial Parity Log
From: jes.sorensen @ 2017-03-29 15:39 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid
In-Reply-To: <20170329095420.17855-1-artur.paszkiewicz@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> This is the mdadm part of the PPL functionality. It adds a new parameter to
> mdadm to allow selecting the consistency policy for an array and implements PPL
> as one of the policies.
>
> All of this is currently targeted and tested with imsm and super1 metadata
> arrays. The kernel patches are available on Shaohua's md tree on the for-next
> branch.

Applied!

This time I got them all - I added one little patch on top which
reorganized the ppl elements in struct mdinfo for slightly better
packing.

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 1/2] super1: replace hard-coded values with bit definitions
From: jes.sorensen @ 2017-03-29 15:41 UTC (permalink / raw)
  To: Gioh Kim; +Cc: neilb, linux-raid, linux-kernel
In-Reply-To: <1490780434-8720-1-git-send-email-gi-oh.kim@profitbricks.com>

Gioh Kim <gi-oh.kim@profitbricks.com> writes:
> Some hard-coded values for disk status are replaced
> with bit definitions.
>
> Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
> ---
>  super1.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)

Applied!

Please use --cover-letter when you send out multi-commit patch sets,
which includes a short description of what the set includes. That way I
can just respond to the cover letter if all the patches are applied.

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 2/2] mdadm.c: fix compile error "switch condition has boolean value"
From: jes.sorensen @ 2017-03-29 15:47 UTC (permalink / raw)
  To: Gioh Kim; +Cc: neilb, linux-raid, linux-kernel
In-Reply-To: <1490780434-8720-2-git-send-email-gi-oh.kim@profitbricks.com>

Gioh Kim <gi-oh.kim@profitbricks.com> writes:
> Remove a boolean expression in switch condition
> to prevent compile error of some compilers.

Please be specific, which compile is unable to handle this?

> Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
> ---
>  mdadm.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mdadm.c b/mdadm.c
> index 08ddcab..a98a051 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -1905,11 +1905,11 @@ static int misc_list(struct mddev_dev *devlist,
>  			rv |= SetAction(dv->devname, c->action);
>  			continue;
>  		}
> -		switch(dv->devname[0] == '/') {
> -			case 0:
> +		switch(dv->devname[0]) {
> +			default:
>  				mdfd = open_dev(dv->devname);
>  				if (mdfd >= 0) break;
> -			case 1:
> +			case '/':
>  				mdfd = open_mddev(dv->devname, 1);  
>  		}
>  		if (mdfd>=0) {

While I agree the original code is ugly, I am not convinced your
replacement is a lot prettier.

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 2/2] mdadm.c: fix compile error "switch condition has boolean value"
From: Gioh Kim @ 2017-03-29 15:50 UTC (permalink / raw)
  To: jes.sorensen; +Cc: neilb, linux-raid, linux-kernel
In-Reply-To: <wrfjpoh0dqzz.fsf@gmail.com>

On Wed, Mar 29, 2017 at 11:47:28AM -0400, jes.sorensen@gmail.com wrote:
> Gioh Kim <gi-oh.kim@profitbricks.com> writes:
> > Remove a boolean expression in switch condition
> > to prevent compile error of some compilers.
> 
> Please be specific, which compile is unable to handle this?
> 
> > Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
> > ---
> >  mdadm.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/mdadm.c b/mdadm.c
> > index 08ddcab..a98a051 100644
> > --- a/mdadm.c
> > +++ b/mdadm.c
> > @@ -1905,11 +1905,11 @@ static int misc_list(struct mddev_dev *devlist,
> >  			rv |= SetAction(dv->devname, c->action);
> >  			continue;
> >  		}
> > -		switch(dv->devname[0] == '/') {
> > -			case 0:
> > +		switch(dv->devname[0]) {
> > +			default:
> >  				mdfd = open_dev(dv->devname);
> >  				if (mdfd >= 0) break;
> > -			case 1:
> > +			case '/':
> >  				mdfd = open_mddev(dv->devname, 1);  
> >  		}
> >  		if (mdfd>=0) {
> 
> While I agree the original code is ugly, I am not convinced your
> replacement is a lot prettier.
> 

Yes, it's not elegant ;-)
Following is my compiler version and error message.

gohkim@ws00837:~$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
5.2.1-22ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-5 --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686
--with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib
--with-tune=generic --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) 

gohkim@ws00837:~/work/tools/mdadm-mainstream$ make
cc -Wall -Werror -Wstrict-prototypes -Wextra -Wno-unused-parameter -ggdb
-DSendmail=\""/usr/sbin/sendmail -t"\" -DCONFFILE=\"/etc/mdadm.conf\"
-DCONFFILE2=\"/etc/mdadm/mdadm.conf\" -DMAP_DIR=\"/run/mdadm\"
-DMAP_FILE=\"map\" -DMDMON_DIR=\"/run/mdadm\"
-DFAILED_SLOTS_DIR=\"/run/mdadm/failed-slots\" -DNO_COROSYNC -DNO_DLM
-DVERSION=\"4.0-25-gddb13a5\" -DVERS_DATE="\"2017-03-29\"" -DUSE_PTHREADS
-DBINDIR=\"/sbin\"  -c -o mdadm.o mdadm.c
mdadm.c: In function ��misc_list��:
mdadm.c:1908:10: error: switch condition has boolean value
[-Werror=switch-bool]
   switch(dv->devname[0] == '/') {
             ^
             cc1: all warnings being treated as errors
             <builtin>: recipe for target 'mdadm.o' failed
             make: *** [mdadm.o] Error 1


-- 
Best regards,
Gi-Oh Kim
TEL: 0176 2697 8962

^ permalink raw reply

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Bart Van Assche @ 2017-03-29 16:28 UTC (permalink / raw)
  To: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, pbonzini@redhat.com,
	shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <e19ca1bf-0908-1eb0-cb04-cd306692f216@redhat.com>

On Wed, 2017-03-29 at 16:51 +0200, Paolo Bonzini wrote:
> On 28/03/2017 20:50, Bart Van Assche wrote:
> > This means that just like the start and end of a discard must be aligned on a
> > discard_granularity boundary, WRITE SAME commands with the UNMAP bit set must
> > also respect that granularity. I think this means that either
> > __blkdev_issue_zeroout() has to be modified such that it rejects unaligned
> > REQ_OP_WRITE_ZEROES operations or that blk_bio_write_same_split() has to be
> > modified such that it generates REQ_OP_WRITEs for the unaligned start and tail.
> 
> I don't think this is the case.

Hello Paolo,

Can you cite the section(s) from the SCSI specs that support your view? I
reread the "5.49 WRITE SAME (10) command" and "4.7.3.4.4 WRITE SAME command
and unmap operations" sections but I have not found any explicit statement
that specifies the behavior for unaligned WRITE SAME commands with the UNMAP
bit set. It seems to me like the OPTIMAL UNMAP GRANULARITY parameter was
overlooked when both sections were written. Should we ask the T10 committee
for a clarification?

Another question is, if the specification of WRITE SAME + UNMAP would be
made unambiguous in the SBC document, whether or not we should take the risk
to trigger behavior that is not what we expect by sending unaligned WRITE
SAME + UNMAP commands to SCSI devices?

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Paolo Bonzini @ 2017-03-29 16:53 UTC (permalink / raw)
  To: Bart Van Assche, agk@redhat.com, lars.ellenberg@linbit.com,
	snitzer@redhat.com, hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490804856.3551.3.camel@sandisk.com>



On 29/03/2017 18:28, Bart Van Assche wrote:
> On Wed, 2017-03-29 at 16:51 +0200, Paolo Bonzini wrote:
>> On 28/03/2017 20:50, Bart Van Assche wrote:
>>> This means that just like the start and end of a discard must be aligned on a
>>> discard_granularity boundary, WRITE SAME commands with the UNMAP bit set must
>>> also respect that granularity. I think this means that either
>>> __blkdev_issue_zeroout() has to be modified such that it rejects unaligned
>>> REQ_OP_WRITE_ZEROES operations or that blk_bio_write_same_split() has to be
>>> modified such that it generates REQ_OP_WRITEs for the unaligned start and tail.
>>
>> I don't think this is the case.
> 
> Hello Paolo,
> 
> Can you cite the section(s) from the SCSI specs that support your view? I
> reread the "5.49 WRITE SAME (10) command" and "4.7.3.4.4 WRITE SAME command
> and unmap operations" sections but I have not found any explicit statement
> that specifies the behavior for unaligned WRITE SAME commands with the UNMAP
> bit set. It seems to me like the OPTIMAL UNMAP GRANULARITY parameter was
> overlooked when both sections were written. Should we ask the T10 committee
> for a clarification?

From 4.7.3.4.4:

------
If unmap operations are requested in a WRITE SAME command,
then for each specified LBA:

if the Data-Out Buffer of the WRITE SAME command is the same as the
logical block data returned by a read operation from that LBA while in
the unmapped state (see 4.7.4.5), then:

1) the device server performs the actions described in table 6; and

2) if an unmap operation is not performed in step 1), then the device
server shall perform the specified write operation to that LBA;
------

and from the description of WRITE SAME (10): "subsequent read operations
behave as if the device server wrote the single block of user data
received from the Data-Out Buffer to each logical block without
modification" (I have a slightly older copy though, it's 5.45 here).

It's pretty unambiguous that if the device cannot unmap (including the
case where the request is misaligned with respect to the granularity) it
does a write.

> Another question is, if the specification of WRITE SAME + UNMAP would be
> made unambiguous in the SBC document, whether or not we should take the risk
> to trigger behavior that is not what we expect by sending unaligned WRITE
> SAME + UNMAP commands to SCSI devices?

Yes, I think we should.

Paolo

^ permalink raw reply

* Grow set size issue
From: jes.sorensen @ 2017-03-29 17:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

In the below patch you changed the error handling, to make the kernel
not setting the size of the device being an error. However we still have
the code in place to handle the error, except it never triggers.

Question is do you remember the reason for this change? Old kernels not
allowing it, are there any legitimate reasons for the kernel to refuse
the size change?

Cheers,
Jes

commit b0a658ffbcd2104594e8a7a185fa0fe05127723e
Author: NeilBrown <neilb@suse.de>
Date:   Thu May 3 16:18:22 2012 +1000

    Grow: failing the set the per-device size is not an error.
    
    Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/Grow.c b/Grow.c
index 0b0d718..330e719 100644
--- a/Grow.c
+++ b/Grow.c
@@ -1668,7 +1668,9 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
                rv = 0;
                for (mdi = sra->devs; mdi; mdi = mdi->next) {
                        if (sysfs_set_num(sra, mdi, "size", size) < 0) {
-                               rv = 1;
+                               /* Probably kernel refusing to let us
+                                * reduce the size - not an error.
+                                */
                                break;
                        }
                        if (array.not_persistent == 0 &&

^ permalink raw reply related

* Re: Grow set size issue
From: NeilBrown @ 2017-03-29 21:34 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid
In-Reply-To: <wrfjlgrodlbj.fsf@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2898 bytes --]

On Thu, Mar 30 2017, jes.sorensen@gmail.com wrote:

> Hi Neil,
>
> In the below patch you changed the error handling, to make the kernel
> not setting the size of the device being an error. However we still have
> the code in place to handle the error, except it never triggers.

So we do.  I should have removed all of that.
I should have just reverted
Commit: 65a9798b58b4 ("FIX: Detect error and rollback metadata")

>
> Question is do you remember the reason for this change? Old kernels not
> allowing it, are there any legitimate reasons for the kernel to refuse
> the size change?

I needed to go further back to remind myself why we do these size change
at all.
The command being run here is "mdadm --grow /dev/mdX --size=foo"
which has a primary purpose of changing the component_size of the array.
What can happen is that someone makes all the components bigger
(E.g. with LVM) and then uses this command to set --size=max, and it
doesn't work.  That is because md doesn't know the devices are bigger.

You can tell md that devices have changes size by writing to the "size"
attribute.
mdadm doesn't have an option for doing that per-device - you need to
poke into sysfs.

To make it a bit easier, when you use "--grow --size=foo", mdadm will
always write "foo" to the "size" attribute of each device, just incase
that will be helpful. In the case where the device is now bigger, it
will.

In the case where the size of the array is being reduced, it is not
permitted to change the "size" of each device until the "component_size"
of the array has changed, so these attempts to change "size" will fail.
But that isn't a problem.

In short, the attempt to change "size" here is a convenience, and
optimization.  It doesn't matter if it fails.

So please just revert all bits of the above commit that are still
present.

Thanks!
NeilBrown

>
> Cheers,
> Jes
>
> commit b0a658ffbcd2104594e8a7a185fa0fe05127723e
> Author: NeilBrown <neilb@suse.de>
> Date:   Thu May 3 16:18:22 2012 +1000
>
>     Grow: failing the set the per-device size is not an error.
>     
>     Signed-off-by: NeilBrown <neilb@suse.de>
>
> diff --git a/Grow.c b/Grow.c
> index 0b0d718..330e719 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -1668,7 +1668,9 @@ int Grow_reshape(char *devname, int fd, int quiet, char *backup_file,
>                 rv = 0;
>                 for (mdi = sra->devs; mdi; mdi = mdi->next) {
>                         if (sysfs_set_num(sra, mdi, "size", size) < 0) {
> -                               rv = 1;
> +                               /* Probably kernel refusing to let us
> +                                * reduce the size - not an error.
> +                                */
>                                 break;
>                         }
>                         if (array.not_persistent == 0 &&

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] mdadm.c: fix compile error "switch condition has boolean value"
From: NeilBrown @ 2017-03-29 21:38 UTC (permalink / raw)
  To: jes.sorensen, Gioh Kim; +Cc: linux-raid, linux-kernel, Wol
In-Reply-To: <wrfjpoh0dqzz.fsf@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

On Thu, Mar 30 2017, jes.sorensen@gmail.com wrote:

> Gioh Kim <gi-oh.kim@profitbricks.com> writes:
>> Remove a boolean expression in switch condition
>> to prevent compile error of some compilers.
>
> Please be specific, which compile is unable to handle this?
>
>> Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
>> ---
>>  mdadm.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/mdadm.c b/mdadm.c
>> index 08ddcab..a98a051 100644
>> --- a/mdadm.c
>> +++ b/mdadm.c
>> @@ -1905,11 +1905,11 @@ static int misc_list(struct mddev_dev *devlist,
>>  			rv |= SetAction(dv->devname, c->action);
>>  			continue;
>>  		}
>> -		switch(dv->devname[0] == '/') {
>> -			case 0:
>> +		switch(dv->devname[0]) {
>> +			default:
>>  				mdfd = open_dev(dv->devname);
>>  				if (mdfd >= 0) break;
>> -			case 1:
>> +			case '/':
>>  				mdfd = open_mddev(dv->devname, 1);  
>>  		}
>>  		if (mdfd>=0) {
>
> While I agree the original code is ugly, I am not convinced your
> replacement is a lot prettier.
>

Maybe

		if (dv->devname[0] == '/' ||
		    (mdfd = open_dev(dv->devname)) < 0)
				mdfd = open_mddev(dv->devname, 1);

??
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: RAID6 rebuild oddity
From: NeilBrown @ 2017-03-30  0:49 UTC (permalink / raw)
  To: Brad Campbell, Linux-RAID; +Cc: Dan Williams
In-Reply-To: <5584b8b1-b4f1-0d2e-cbf3-f14b4127d031@fnarfbargle.com>

[-- Attachment #1: Type: text/plain, Size: 8266 bytes --]

On Wed, Mar 29 2017, Brad Campbell wrote:

> On 29/03/17 12:08, NeilBrown wrote:
>
>>
>> sdj is getting twice the utilization of the others but only about 10%
>> more rKB/sec.  That suggests lots of seeking.
>
> Yes, something is not entirely sequential.
>
>> Does "fuser /dev/sdj" report anything funny?
>
> No. No output. As far as I can tell nothing should be touching the disks 
> other than the md kernel thread.
>
>> Is there filesystem IO happening? If so, what filesystem?
>> Have you told the filesystem about the RAID layout?
>> Maybe the filesystem keeps reading some index blocks that are always on
>> the same drive.
>
> No. I probably wasn't as clear as I should have been in the initial 
> post. There was nothing mounted at the time.
>
> Right now the array contains one large LUKS container (dm-crypt). This 
> was mounted and a continuous dd done to the dm device to zero it out :
>
> 4111195+1 records in
> 4111195+1 records out
> 34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s
>
> So there is no filesystem on the drive.
>
> I failed and removed sdi :
>
> root@test:~# mdadm --fail /dev/md0 /dev/sdi
> mdadm: set /dev/sdi faulty in /dev/md0
> root@test:~# mdadm --remove /dev/md0 /dev/sdi
> mdadm: hot removed /dev/sdi from /dev/md0
> root@test:~# mdadm --zero-superblock /dev/sdi
> root@test:~# mdadm --add /dev/md0 /dev/sdi
> mdadm: added /dev/sdi
>
> Rebooted the machine to remove all tweaks of things like stripe cache 
> size, readahead, NCQ and anything else.
>
> I opened the LUKS container, dd'd a meg to the start to write to the 
> array and kick off the resync, then closed the LUKS container. At this 
> point dm should no longer be touching the drive and I've verified the 
> device has gone.
>
> I then ran sync a couple of times and waited a couple of minutes until I 
> was positive _nothing_ was touching md0, then ran :
>
> blktrace -w 5 /dev/sd[bcdefgij] /dev/md0
>
>> So the problem moves from drive to drive?  Strongly suggests filesystem
>> metadata access to me.
>
> Again, sorry for me not being clear. The situation changes on a resync 
> specific basis. For example the reproduction I've done now I popped out 
> sdi rather than sdb, and now the bottleneck is sdg. It is the same if 
> the exact circumstances remain the same.

Now *that* is interesting!

In the first example you gave, device 0 was rebuilding and device 7 was
slow.
Now device 6 is rebuilding and device5 is slow.  Surely you see the
pattern?

Also, I previously observed that the slow device was getting 10% more
read traffic, but I was being imprecise.
In the first example  it was 16.2%, in the second it is 14.5.
These are close to 1/7.

There are 8 devices in the array, so there are 8 different layouts for
stripes (the parity block can be on 8 different devices).  8 is close to
7.

How recovery *should* work on RAID6 is that for each strip we read all
blocks except for the Q block (which isn't needed to rebuild a single
device) and the block which has failed (of course).  For the stripe
where the Q block has failed, we don't read the P block.  Just read all
the data and calculate Q from that.

So for every 8-device strip, we read from 6 devices.  5 Data plus parity
when data has failed, 6 data when P or Q has failed.
So across each of the 8 different strip layouts, there are 48 reads
Spread across 7 working devices....  using lower-case for blocks that
aren't read and assuming first device is being recovered:

           d  D  D  D  D  D  P  q
           d  D  D  D  D  P  q  D
           d  D  D  D  P  q  D  D
           d  D  D  P  q  D  D  D
           d  D  P  q  D  D  D  D
           d  P  q  D  D  D  D  D
           p  q  D  D  D  D  D  D
           q  D  D  D  D  D  D  p
totals:    0  7  7  7  7  7  7  6

The device just "before" the missing device gets fewer reads,
because we don't need to read P from that device.
But we are seeing that it gets more reads.  The numbers suggest that
it gets 8 reads (1/7 more than the others).  But the numbers also
suggest that it gets lots of seeks.  Maybe after all the other devices
have break read for one stripe, md/raid6 decides it does need those
blocks and goes back to read them?  That would explain the seeking
and the increased number of reads.

Time to look at the blktrace...
Looking at sdj, each request is Queued 8 times (I wonder why) but the
requests are completely sequential.

  blkparse sdj* | awk '$6 == "Q" && $8 != prev {print $8, $8-prev ; prev=$8}'  | awk '$2 != 8'

This means the 'Q' block is being read.  Not what I expected, but not
too surprising.

Looking at sdg, which is the problem device:

  blkparse sdg* | awk '$6 == "Q" && $8 != prev {print $8-prev ; prev=$8}'  \
    | sort | uniq -c | sort -n | tail -60

There are lots of places we skip forward 904 sectors.  That is an 8
sector read and a 896 sector seek. 896 sectors = 448K or 7 chunks
(64K chunk size).
These 7-chunk seeks are separated by 16 8-sector reads.  So it seems to
be reading all the P (or maybe Q) blocks from a sequence of stripes.

There are also lots of large back/forward seeks. They range from -31328
to -71280 backward and 27584 to 66408  (with a couple of smaller ones).

If we ignore all reads to sdg that are the result of seeking backwards -
i.e. that are earlier that the largest offset seen so far:

  blkparse sdg* | awk '$6 == "Q" && $8 != prev {if ($8>max) {print $8; max=$8} ; prev=$8}' | awk '{print $1-prev; prev=$1}' | grep  -v 8

we seek that (after an initial section) every block is being read.
So it seems that sdg is seeking backwards to read some blocks that it
has already read.

So what really happens in that all the working devices read all blocks
(although they don't really need Q), and sdg needs to seek backwards to
re-read 1/8 of all blocks, probably the P block.

So we should be seeing the rkB/s for the slow disk 1/8 more than the
others.  i.e. 12.5%, not 14% or 16%.  The difference bothers me, but not
very much.

Now to look at the code.

need_this_block() always returns 1 when s->syncing, so that is why
we already read the Q block.  The distinction between recovery and
resync isn't very strong in raid5/6.

So on the stripes where Q has failed, we read all the Data and P.
Then handle_parity_checks6() notice that there are enough devices that
something is worth checking, so it checks the P.  This is destructive!
of P.  The D blocks are xored into P and P is checked for all-zero.

I always get lost following this next bit....
I think that after deciding zero_sum_result is zero, check_state is set
to check_state_compute_result.
But before that is processed, we go around the loop again and notice
that P is missing (because handle_parity_checks6() cleared R5_UPTODATE)
so need_this_block() decided that we need to read it back in again.
We don't calculate it because we only calculate blocks on failed
devices (I'm sure we didn't used to do that).
So we read P back in ... just as we see from the trace.

Patch below makes the symptom go away in my testing, which confirms
that I'm on the right track.

Dan Williams changed the code to only recompute blocks from failed
devices, way back in 2008.
Commit: c337869d9501 ("md: do not compute parity unless it is on a failed drive")

Back then we didn't have raid6 in the same code, so the fix was probably
fine.

I wonder what the best fix is.... Dan... Any thoughts?

NeilBrown

This works for simple tests but might not be correct.

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c523fd69a7bc..2eb45d57226c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3618,8 +3618,9 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
 		BUG_ON(test_bit(R5_Wantread, &dev->flags));
 		BUG_ON(sh->batch_head);
 		if ((s->uptodate == disks - 1) &&
+		    ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
 		    (s->failed && (disk_idx == s->failed_num[0] ||
-				   disk_idx == s->failed_num[1]))) {
+				   disk_idx == s->failed_num[1])))) {
 			/* have disk failed, and we're requested to fetch it;
 			 * do compute it
 			 */

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: RAID6 rebuild oddity
From: Brad Campbell @ 2017-03-30  1:22 UTC (permalink / raw)
  To: NeilBrown, Linux-RAID; +Cc: Dan Williams
In-Reply-To: <87wpb7612w.fsf@notabene.neil.brown.name>

> NeilBrown
>
> This works for simple tests but might not be correct.
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c523fd69a7bc..2eb45d57226c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3618,8 +3618,9 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
>   		BUG_ON(test_bit(R5_Wantread, &dev->flags));
>   		BUG_ON(sh->batch_head);
>   		if ((s->uptodate == disks - 1) &&
> +		    ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
>   		    (s->failed && (disk_idx == s->failed_num[0] ||
> -				   disk_idx == s->failed_num[1]))) {
> +				   disk_idx == s->failed_num[1])))) {
>   			/* have disk failed, and we're requested to fetch it;
>   			 * do compute it
>   			 */

G'day Neil,

Thanks for the in-depth analysis. I managed to follow most of it and 
when I get some time I'll get my head into the code and see if I can 
*really* follow along.

As I'm not particularly fussed about the integrity of the system in its 
current state, I patched the kernel, re-booted and kicked off the resync 
again.


Before patch :

brad@test:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdb[8] sdj[7] sdi[9] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
       35162348160 blocks super 1.2 level 6, 64k chunk, algorithm 2 
[8/7] [UUUUUU_U]
       [==============>......]  recovery = 72.0% (4222847296/5860391360) 
finish=541.6min speed=50382K/sec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.32    0.00    7.02    1.37    0.00   90.29

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    19.00   10.20   22.40   114.40   139.00 
15.55     0.52   15.95   26.27   11.25   5.03  16.40
sdb           16514.60     0.00  141.00    0.00 67860.00     0.00 
962.55     2.64   18.91   18.91    0.00   3.21  45.20
sdc           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.06   14.77   14.77    0.00   2.79  39.20
sdd           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.64   18.92   18.92    0.00   3.04  42.80
sde           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.49   17.85   17.85    0.00   3.14  44.20
sdf           16514.60     0.00  140.60    0.00 67655.20     0.00 
962.38     2.24   16.05   16.05    0.00   2.86  40.20
sdg           18459.80     0.00  269.40    0.00 74772.00     0.00 
555.10    84.84  327.45  327.45    0.00   3.71 100.00
sdh               0.00    19.00    1.80   22.40    24.80   139.00 
13.54     0.25   10.17   10.00   10.18   5.87  14.20
sdi               0.00 16390.20    0.00  136.60     0.00 66138.40 
968.35     1.19    8.74    0.00    8.74   3.38  46.20
sdj           16514.40     0.00  140.80    0.00 67654.40     0.00 
961.00     2.25   16.11   16.11    0.00   2.86  40.20
md1               0.00     0.00    0.00    4.80     0.00 4.60     
1.92     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00   12.00   32.60   139.20   131.20 
12.13     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00

After Patch :

root@test:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid6 sdb[8] sdj[7] sdi[9] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
       35162348160 blocks super 1.2 level 6, 64k chunk, algorithm 2 
[8/7] [UUUUUU_U]
       [==============>......]  recovery = 73.3% (4297641980/5860391360) 
finish=284.7min speed=91465K/sec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.08    0.00    6.96    0.00    0.00   92.97

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.20     0.00 0.20     
2.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb           28232.80     0.00  237.40    0.00 114502.40     0.00 
964.64     8.53   36.05   36.05    0.00   4.15  98.60
sdc           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.21    5.13    5.13    0.00   1.77  41.80
sdd           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.28    5.42    5.42    0.00   1.85  43.60
sde           28232.80     0.00  236.80    0.00 114195.20     0.00 
964.49     2.74   11.65   11.65    0.00   2.77  65.60
sdf           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.56    6.63    6.63    0.00   2.14  50.60
sdg           28232.80     0.00  234.80    0.00 113273.60     0.00 
964.85     3.06   12.86   12.86    0.00   2.82  66.20
sdh               0.00     0.00    0.00    0.20     0.00 0.20     
2.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00 28175.00    0.00  245.80     0.00 113683.20 
925.01     0.96    3.91    0.00    3.91   2.89  71.00
sdj           28232.80     0.00  236.00    0.00 113875.20     0.00 
965.04     1.53    6.50    6.50    0.00   2.07  48.80
md1               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00 0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00

So that has significantly increased the rebuild speed, as you would expect.


Regards,
Brad
-- 
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox