Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [PATCH 1/7] Makefile, LLVM: add -no-integrated-as to KBUILD_[AC]FLAGS
From: Michael Davidson @ 2017-03-17  0:15 UTC (permalink / raw)
  To: Michal Marek, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Herbert Xu, David S. Miller, Shaohua Li
  Cc: Alexander Potapenko, Dmitry Vyukov, Matthias Kaehlcke, x86,
	linux-kbuild, linux-kernel, linux-crypto, linux-raid,
	Michael Davidson
In-Reply-To: <20170317001520.85223-1-md@google.com>

Add -no-integrated-as to KBUILD_AFLAGS and KBUILD_CFLAGS
for clang.

Signed-off-by: Michael Davidson <md@google.com>
---
 Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Makefile b/Makefile
index b841fb36beb2..b21fd0ca2946 100644
--- a/Makefile
+++ b/Makefile
@@ -704,6 +704,8 @@ KBUILD_CFLAGS += $(call cc-disable-warning, tautological-compare)
 # See modpost pattern 2
 KBUILD_CFLAGS += $(call cc-option, -mno-global-merge,)
 KBUILD_CFLAGS += $(call cc-option, -fcatch-undefined-behavior)
+KBUILD_CFLAGS += $(call cc-option, -no-integrated-as)
+KBUILD_AFLAGS += $(call cc-option, -no-integrated-as)
 else
 
 # These warnings generated too much noise in a regular build.
-- 
2.12.0.367.g23dc2f6d3c-goog

^ permalink raw reply related

* [PATCH 0/7] LLVM: make x86_64 kernel build with clang.
From: Michael Davidson @ 2017-03-17  0:15 UTC (permalink / raw)
  To: Michal Marek, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Herbert Xu, David S. Miller, Shaohua Li
  Cc: Alexander Potapenko, Dmitry Vyukov, Matthias Kaehlcke, x86,
	linux-kbuild, linux-kernel, linux-crypto, linux-raid,
	Michael Davidson

This patch set is sufficient to get the x86_64 kernel to build
and boot correctly with clang-3.8 or greater.

The resulting build still has about 300 warnings, very few of
which appear to be significant. Most of them should be fixable
with some minor code refactoring although a few of them, such
as the complaints about implict conversions between enumerated
types may be candidates for just being disabled.

Michael Davidson (7):
  Makefile, LLVM: add -no-integrated-as to KBUILD_[AC]FLAGS
  Makefile, x86, LLVM: disable unsupported optimization flags
  x86, LLVM: suppress clang warnings about unaligned accesses
  x86, boot, LLVM: #undef memcpy etc in string.c
  x86, boot, LLVM: Use regparm=0 for memcpy and memset
  md/raid10, LLVM: get rid of variable length array
  crypto, x86, LLVM: aesni - fix token pasting

 Makefile                                |  4 ++++
 arch/x86/Makefile                       |  7 +++++++
 arch/x86/boot/copy.S                    | 15 +++++++++++++--
 arch/x86/boot/string.c                  |  9 +++++++++
 arch/x86/boot/string.h                  | 13 +++++++++++++
 arch/x86/crypto/aes_ctrby8_avx-x86_64.S |  7 ++-----
 drivers/md/raid10.c                     |  9 ++++-----
 7 files changed, 52 insertions(+), 12 deletions(-)

-- 
2.12.0.367.g23dc2f6d3c-goog


^ permalink raw reply

* recovery does not complete after --add
From: Eyal Lebedinsky @ 2017-03-16 23:51 UTC (permalink / raw)
  To: list linux-raid

This is a repost of the issue (from a month ago) that did not get a response then.

Executive summary:
After '--add'ing a new member a 'recovery' starts automatically but 'sync_max' is not reset
and the recovery hangs part way through where sync_max happened to be. This is a 7 disk raid6.

Is this a known issue? Was it fixed since? Did I do something wrong?

This machine runs the older f19.
     $ uname -a
Linux e7.eyal.emu.id.au 3.14.27-100.fc19.x86_64 #1 SMP Wed Dec 17 19:36:34 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

mdadm was built from source:
     $ sudo mdadm --version
mdadm - v4.0 - 2017-01-09

The long story:

I had a disk fail in a raid6. After some 'pending' sectors were logged I decided to do a 'check'
around that location by setting sync_min/max and echo 'check'. This is done with a script doing:
     # echo 4336657408 >sys/block/md127/md/sync_min
     # echo 4339803136 >sys/block/md127/md/sync_max
     # echo check      >sys/block/md127/md/sync_action
The messages then say
     Feb 18 13:46:31 e7 kernel: [  976.688691] md: data-check of RAID array md127
     Feb 18 13:46:31 e7 kernel: [  976.693254] md: minimum _guaranteed_  speed: 150000 KB/sec/disk.
     Feb 18 13:46:31 e7 kernel: [  976.699479] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
     Feb 18 13:46:31 e7 kernel: [  976.709420] md: using 128k window, over a total of 3906885120k.
     Feb 18 13:46:31 e7 kernel: [  976.715457] md: resuming data-check of md127 from checkpoint.

Sure enough this elicited disk errors, but the disk did not recover and it was kicked out of the array.
Moreover it became unresponsive. It needed a power cycle so I shutdown and rebooted the machine.

messages:
     ... many i/o errors then sdf completely disappeared ... errors at sectors 4337414{000,040,168}
     Feb 18 13:47:08 e7 kernel: [ 1014.334781] md: super_written gets error=-5, uptodate=0
     Feb 18 13:47:08 e7 kernel: [ 1014.340024] md/raid:md127: Disk failure on sdf1, disabling device.
     Feb 18 13:47:08 e7 kernel: [ 1014.340024] md/raid:md127: Operation continuing on 6 devices.
     Feb 18 13:47:08 e7 kernel: [ 1014.417307] md: md127: data-check interrupted.

A second power off/on, a check produced the same result. At this point I added a fresh disk:
     $ sudo mdadm /dev/md127 --add /dev/sdj1
     $ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid6 sdj1[11] sdf1[7](F) sdi1[8] sde1[9] sdh1[12] sdc1[0] sdg1[13] sdd1[10]
       19534425600 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [UUU_UUU]
       [>....................]  recovery =  0.7% (29805572/3906885120) finish=509.2min speed=126880K/sec
       bitmap: 7/30 pages [28KB], 65536KB chunk

messages:
     Feb 18 14:23:10 e7 kernel: [ 3177.183250] md: bind<sdj1>
     Feb 18 14:23:10 e7 kernel: [ 3177.255529] md: recovery of RAID array md127
     Feb 18 14:23:10 e7 kernel: [ 3177.259894] md: minimum _guaranteed_  speed: 150000 KB/sec/disk.
     Feb 18 14:23:10 e7 kernel: [ 3177.265994] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
     Feb 18 14:23:10 e7 kernel: [ 3177.275736] md: using 128k window, over a total of 3906885120k.

However, the recovery stopped progressing at one point (my script logs /proc/mdstat every 10 seconds):
     2017-02-18 20:02:48        [===========>.........]  recovery = 55.4% (2166229192/3906885120) finish=372.8min speed=77803K/sec
     2017-02-18 20:02:58        [===========>.........]  recovery = 55.4% (2167083344/3906885120) finish=366.2min speed=79159K/sec
     2017-02-18 20:03:08        [===========>.........]  recovery = 55.4% (2167819876/3906885120) finish=374.8min speed=77316K/sec
     2017-02-18 20:03:18        [===========>.........]  recovery = 55.5% (2168520428/3906885120) finish=375.4min speed=77157K/sec
     2017-02-18 20:03:28        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=489.4min speed=59194K/sec
     2017-02-18 20:03:38        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=608.7min speed=47588K/sec
     2017-02-18 20:03:48        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=728.1min speed=39786K/sec
     2017-02-18 20:03:58        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=847.5min speed=34182K/sec
     ... no progress anymore
     2017-02-18 22:36:44        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110261.8min speed=262K/sec
     2017-02-18 22:36:54        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110381.2min speed=262K/sec
     2017-02-18 22:37:04        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110500.6min speed=262K/sec
     2017-02-18 22:37:14        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110619.9min speed=261K/sec

After some thinking I realised that it has paused at the point where the earlier 'check' failed. This was unexpected.
I followed with
     # echo 'max' >/sys/block/md127/md/sync_max
the recovery now moves on:
     2017-02-18 22:37:24        [===========>.........]  recovery = 55.5% (2168938500/3906885120) finish=117500.2min speed=246K/sec
     2017-02-18 22:37:34        [===========>.........]  recovery = 55.5% (2169997568/3906885120) finish=105201.7min speed=275K/sec
     2017-02-18 22:37:44        [===========>.........]  recovery = 55.5% (2171066120/3906885120) finish=90962.0min speed=318K/sec
     2017-02-18 22:37:54        [===========>.........]  recovery = 55.5% (2172125192/3906885120) finish=269.9min speed=107101K/sec
     2017-02-18 22:38:04        [===========>.........]  recovery = 55.6% (2173114372/3906885120) finish=272.1min speed=106165K/sec
     2017-02-18 22:38:14        [===========>.........]  recovery = 55.6% (2174004224/3906885120) finish=287.3min speed=100492K/sec

### and it completed over six hours later:
     Feb 19 04:49:16 e7 kernel: [55167.633100] md: md127: recovery done.

TIA

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply

* [PATCH v2] mdadm/r5cache: allow adding journal to array without journal
From: Song Liu @ 2017-03-16 23:24 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, neilb, kernel-team, dan.j.williams, hch, jes.sorensen,
	Song Liu

Currently, --add-journal can be only used to recreate broken journal
for arrays with journal since  creation. As the kernel code getting
more mature, this constraint is no longer necessary.

This patch allows --add-journal to add journal to array without
journal.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 Manage.c   | 9 ---------
 mdadm.8.in | 5 ++---
 2 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/Manage.c b/Manage.c
index 5c3d2b9..d038b68 100644
--- a/Manage.c
+++ b/Manage.c
@@ -946,7 +946,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 
 	/* only add journal to array that supports journaling */
 	if (dv->disposition == 'j') {
-		struct mdinfo mdi;
 		struct mdinfo *mdp;
 
 		mdp = sysfs_read(fd, NULL, GET_ARRAY_STATE);
@@ -960,14 +959,6 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 			pr_err("%s is not readonly, cannot add journal.\n", devname);
 			return -1;
 		}
-
-		sysfs_free(mdp);
-
-		tst->ss->getinfo_super(tst, &mdi, NULL);
-		if (mdi.journal_device_required == 0) {
-			pr_err("%s does not support journal device.\n", devname);
-			return -1;
-		}
 		disc.raid_disk = 0;
 	}
 
diff --git a/mdadm.8.in b/mdadm.8.in
index df1d460..9e5e896 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -1474,9 +1474,8 @@ the device is found or <slot>:missing in case the device is not found.
 
 .TP
 .BR \-\-add-journal
-Recreate journal for RAID-4/5/6 array that lost a journal device. In the
-current implementation, this command cannot add a journal to an array
-that had a failed journal. To avoid interrupting on-going write opertions,
+Add journal to an existing array, or recreate journal for RAID-4/5/6 array
+that lost a journal device. To avoid interrupting on-going write opertions,
 .B \-\-add-journal
 only works for array in Read-Only state.
 
-- 
2.9.3


^ permalink raw reply related

* Re: Read data from disk that was part of RAID1 array
From: Reindl Harald @ 2017-03-16 22:31 UTC (permalink / raw)
  To: Peter Sangas, linux-raid
In-Reply-To: <008801d29ea3$3e858000$bb908000$@warren-selbert.com>



Am 16.03.2017 um 23:18 schrieb Peter Sangas:
> My setup:
>
> Linux green 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux
> Ubuntu 16.04.2 LTS
>
> mdadm -V
> mdadm - v3.3 - 3rd September 2013
>
>
> I have only one disk from a RAID1 array and I would like to read data from
> one of the partitions.  This there a way to mount this and read the data?

you only need one disk to start a RAID1
RAID1 are ust mirrors

just assemble the RAID1 or most likely you can mount the partitions as 
they are

^ permalink raw reply

* Read data from disk that was part of RAID1 array
From: Peter Sangas @ 2017-03-16 22:18 UTC (permalink / raw)
  To: linux-raid

Hi,

My setup:

Linux green 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 16.04.2 LTS

mdadm -V
mdadm - v3.3 - 3rd September 2013

I have only one disk from a RAID1 array and I would like to read data from
one of the partitions.  This there a way to mount this and read the data?

Thanks,
Pete

^ permalink raw reply

* [PATCH v2 2/2] md/raid5: use consistency_policy to remove journal feature
From: Song Liu @ 2017-03-16 22:14 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, neilb, kernel-team, dan.j.williams, hch, jes.sorensen,
	Song Liu
In-Reply-To: <20170316221409.3283344-1-songliubraving@fb.com>

When journal device of an array fails, the array is forced into read-only
mode. To make the array normal without adding another journal device, we
need to remove journal _feature_ from the array.

This patch allows remove journal _feature_ from an array, For journal
existing journal should be either missing or faulty.

To remove journal feature, one can simply echo into the sysfs file:

 cat /sys/block/md0/md/consistency_policy
 journal

 echo resync > /sys/block/md0/md/consistency_policy
 cat /sys/block/md0/md/consistency_policy
 resync

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7a6e7ea..2302ec4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8369,11 +8369,19 @@ static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
 		if (!err)
 			raid5_reset_stripe_cache(mddev);
 		mddev_resume(mddev);
-	} else if (strncmp(buf, "resync", 6) == 0 && raid5_has_ppl(conf)) {
-		mddev_suspend(mddev);
-		log_exit(conf);
-		raid5_reset_stripe_cache(mddev);
-		mddev_resume(mddev);
+	} else if (strncmp(buf, "resync", 6) == 0) {
+		if (raid5_has_ppl(conf)) {
+			mddev_suspend(mddev);
+			log_exit(conf);
+			raid5_reset_stripe_cache(mddev);
+			mddev_resume(mddev);
+		} else if (test_bit(MD_HAS_JOURNAL, &conf->mddev->flags) &&
+			   r5l_log_disk_error(conf)) {
+			mddev_suspend(mddev);
+			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
+			mddev_resume(mddev);
+		} else
+			err = -EINVAL;
 	} else {
 		err = -EINVAL;
 	}
-- 
2.9.3


^ permalink raw reply related

* [PATCH v2 1/2] md/raid5: Add change_consistency_policy to raid4 and raid6
From: Song Liu @ 2017-03-16 22:14 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, neilb, kernel-team, dan.j.williams, hch, jes.sorensen,
	Song Liu

Add change_consistency_policy to raid6_personality and
raid4_personality.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 88cc898..7a6e7ea 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8408,6 +8408,7 @@ static struct md_personality raid6_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid6_takeover,
 	.congested	= raid5_congested,
+	.change_consistency_policy = raid5_change_consistency_policy,
 };
 static struct md_personality raid5_personality =
 {
@@ -8456,6 +8457,7 @@ static struct md_personality raid4_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid4_takeover,
 	.congested	= raid5_congested,
+	.change_consistency_policy = raid5_change_consistency_policy,
 };
 
 static int __init raid5_init(void)
-- 
2.9.3


^ permalink raw reply related

* [PATCH v3 6/6] Grow: support consistency policy change
From: Artur Paszkiewicz @ 2017-03-16 21:09 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170316210948.21093-1-artur.paszkiewicz@intel.com>

Extend the --consistency-policy parameter to work also in Grow mode.
Using it changes the currently active consistency policy in the kernel
driver and updates the metadata to make this change permanent. Currently
this supports only changing between "ppl" and "resync" policies, that is
enabling or disabling PPL at runtime.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Grow.c     | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mdadm.8.in |  18 ++++++-
 mdadm.c    |   3 ++
 mdadm.h    |   2 +
 4 files changed, 194 insertions(+), 1 deletion(-)

diff --git a/Grow.c b/Grow.c
index e4351d7f..c01d0945 100755
--- a/Grow.c
+++ b/Grow.c
@@ -528,6 +528,178 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 	return 0;
 }
 
+int Grow_consistency_policy(char *devname, int fd, struct context *c, struct shape *s)
+{
+	struct supertype *st;
+	struct mdinfo *sra;
+	struct mdinfo *sd;
+	char *subarray = NULL;
+	int ret = 0;
+	char container_dev[PATH_MAX];
+
+	if (s->consistency_policy != CONSISTENCY_POLICY_RESYNC &&
+	    s->consistency_policy != CONSISTENCY_POLICY_PPL) {
+		pr_err("Operation not supported for consistency policy %s\n",
+		       map_num(consistency_policies, s->consistency_policy));
+		return 1;
+	}
+
+	st = super_by_fd(fd, &subarray);
+	if (!st)
+		return 1;
+
+	sra = sysfs_read(fd, NULL, GET_CONSISTENCY_POLICY|GET_LEVEL|
+				   GET_DEVS|GET_STATE);
+	if (!sra) {
+		ret = 1;
+		goto free_st;
+	}
+
+	if (s->consistency_policy == CONSISTENCY_POLICY_PPL &&
+	    !st->ss->write_init_ppl) {
+		pr_err("%s metadata does not support PPL\n", st->ss->name);
+		ret = 1;
+		goto free_info;
+	}
+
+	if (sra->array.level != 5) {
+		pr_err("Operation not supported for array level %d\n",
+				sra->array.level);
+		ret = 1;
+		goto free_info;
+	}
+
+	if (sra->consistency_policy == (unsigned)s->consistency_policy) {
+		pr_err("Consistency policy is already %s\n",
+		       map_num(consistency_policies, s->consistency_policy));
+		ret = 1;
+		goto free_info;
+	} else if (sra->consistency_policy != CONSISTENCY_POLICY_RESYNC &&
+		   sra->consistency_policy != CONSISTENCY_POLICY_PPL) {
+		pr_err("Current consistency policy is %s, cannot change to %s\n",
+		       map_num(consistency_policies, sra->consistency_policy),
+		       map_num(consistency_policies, s->consistency_policy));
+		ret = 1;
+		goto free_info;
+	}
+
+	if (subarray) {
+		char *update;
+
+		if (s->consistency_policy == CONSISTENCY_POLICY_PPL)
+			update = "ppl";
+		else
+			update = "no-ppl";
+
+		sprintf(container_dev, "/dev/%s", st->container_devnm);
+
+		ret = Update_subarray(container_dev, subarray, update, NULL,
+				      c->verbose);
+		if (ret)
+			goto free_info;
+	}
+
+	if (s->consistency_policy == CONSISTENCY_POLICY_PPL) {
+		struct mdinfo info;
+
+		if (subarray) {
+			struct mdinfo *mdi;
+			int cfd;
+
+			cfd = open(container_dev, O_RDWR|O_EXCL);
+			if (cfd < 0) {
+				pr_err("Failed to open %s\n", container_dev);
+				ret = 1;
+				goto free_info;
+			}
+
+			ret = st->ss->load_container(st, cfd, st->container_devnm);
+			close(cfd);
+
+			if (ret) {
+				pr_err("Cannot read superblock for %s\n",
+				       container_dev);
+				goto free_info;
+			}
+
+			mdi = st->ss->container_content(st, subarray);
+			info = *mdi;
+			free(mdi);
+		}
+
+		for (sd = sra->devs; sd; sd = sd->next) {
+			int dfd;
+			char *devpath;
+
+			if ((sd->disk.state & (1 << MD_DISK_SYNC)) == 0)
+				continue;
+
+			devpath = map_dev(sd->disk.major, sd->disk.minor, 0);
+			dfd = dev_open(devpath, O_RDWR);
+			if (dfd < 0) {
+				pr_err("Failed to open %s\n", devpath);
+				ret = 1;
+				goto free_info;
+			}
+
+			if (!subarray) {
+				ret = st->ss->load_super(st, dfd, NULL);
+				if (ret) {
+					pr_err("Failed to load super-block.\n");
+					close(dfd);
+					goto free_info;
+				}
+
+				ret = st->ss->update_super(st, sra, "ppl", devname,
+							   c->verbose, 0, NULL);
+				if (ret) {
+					close(dfd);
+					st->ss->free_super(st);
+					goto free_info;
+				}
+				st->ss->getinfo_super(st, &info, NULL);
+			}
+
+			ret |= sysfs_set_num(sra, sd, "ppl_sector", info.ppl_sector);
+			ret |= sysfs_set_num(sra, sd, "ppl_size", info.ppl_size);
+
+			if (ret) {
+				pr_err("Failed to set PPL attributes for %s\n",
+				       sd->sys_name);
+				close(dfd);
+				st->ss->free_super(st);
+				goto free_info;
+			}
+
+			ret = st->ss->write_init_ppl(st, &info, dfd);
+			if (ret)
+				pr_err("Failed to write PPL\n");
+
+			close(dfd);
+
+			if (!subarray)
+				st->ss->free_super(st);
+
+			if (ret)
+				goto free_info;
+		}
+	}
+
+	ret = sysfs_set_str(sra, NULL, "consistency_policy",
+			    map_num(consistency_policies,
+				    s->consistency_policy));
+	if (ret)
+		pr_err("Failed to change array consistency policy\n");
+
+free_info:
+	sysfs_free(sra);
+free_st:
+	free(st);
+	free(subarray);
+
+	return ret;
+}
+
 /*
  * When reshaping an array we might need to backup some data.
  * This is written to all spares with a 'super_block' describing it.
diff --git a/mdadm.8.in b/mdadm.8.in
index 1178ed9b..744c12b5 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -126,7 +126,7 @@ of component devices and changing the number of active devices in
 Linear and RAID levels 0/1/4/5/6,
 changing the RAID level between 0, 1, 5, and 6, and between 0 and 10,
 changing the chunk size and layout for RAID 0,4,5,6,10 as well as adding or
-removing a write-intent bitmap.
+removing a write-intent bitmap and changing the array's consistency policy.
 
 .TP
 .B "Incremental Assembly"
@@ -1050,6 +1050,10 @@ after unclean shutdown. Implicitly selected when using
 For RAID5 only, Partial Parity Log is used to close the write hole and
 eliminate resync. PPL is stored in the metadata region of RAID member drives,
 no additional journal drive is needed.
+
+.PP
+Can be used with \-\-grow to change the consistency policy of an active array
+in some cases. See CONSISTENCY POLICY CHANGES below.
 .RE
 
 
@@ -2694,6 +2698,8 @@ RAID0, RAID4, and RAID5, and between RAID0 and RAID10 (in the near-2 mode).
 .IP \(bu 4
 add a write-intent bitmap to any array which supports these bitmaps, or
 remove a write-intent bitmap from such an array.
+.IP \(bu 4
+change the array's consistency policy.
 .PP
 
 Using GROW on containers is currently supported only for Intel's IMSM
@@ -2850,6 +2856,16 @@ can be added.  Note that if you add a bitmap stored in a file which is
 in a filesystem that is on the RAID array being affected, the system
 will deadlock.  The bitmap must be on a separate filesystem.
 
+.SS CONSISTENCY POLICY CHANGES
+
+The consistency policy of an active array can be changed by using the
+.B \-\-consistency\-policy
+option in Grow mode. Currently this works only for the
+.B ppl
+and
+.B resync
+policies and allows to enable or disable the RAID5 Partial Parity Log (PPL).
+
 .SH INCREMENTAL MODE
 
 .HP 12
diff --git a/mdadm.c b/mdadm.c
index 3d0da1ec..0db4cb33 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1217,6 +1217,7 @@ int main(int argc, char *argv[])
 			s.journaldisks = 1;
 			continue;
 		case O(CREATE, 'k'):
+		case O(GROW, 'k'):
 			s.consistency_policy = map_name(consistency_policies,
 							optarg);
 			if (s.consistency_policy == UnSet ||
@@ -1675,6 +1676,8 @@ int main(int argc, char *argv[])
 			rv = Grow_reshape(devlist->devname, mdfd,
 					  devlist->next,
 					  data_offset, &c, &s);
+		} else if (s.consistency_policy != UnSet) {
+			rv = Grow_consistency_policy(devlist->devname, mdfd, &c, &s);
 		} else if (array_size == 0)
 			pr_err("no changes to --grow\n");
 		break;
diff --git a/mdadm.h b/mdadm.h
index ab1b7fc6..7173b258 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1331,6 +1331,8 @@ extern int Grow_restart(struct supertype *st, struct mdinfo *info,
 extern int Grow_continue(int mdfd, struct supertype *st,
 			 struct mdinfo *info, char *backup_file,
 			 int forked, int freeze_reshape);
+extern int Grow_consistency_policy(char *devname, int fd,
+				   struct context *c, struct shape *s);
 
 extern int restore_backup(struct supertype *st,
 			  struct mdinfo *content,
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 5/6] Add 'ppl' and 'no-ppl' options for --update=
From: Artur Paszkiewicz @ 2017-03-16 21:09 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170316210948.21093-1-artur.paszkiewicz@intel.com>

This can be used with --assemble for super1 and with --update-subarray
for imsm to enable or disable PPL in the metadata.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Assemble.c    |  6 ++++++
 mdadm.8.in    | 27 ++++++++++++++++++++++++---
 mdadm.c       |  6 +++++-
 super-intel.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 super1.c      | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 139 insertions(+), 4 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index c0984201..6a6a56bf 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -602,6 +602,12 @@ static int load_devices(struct devs *devices, char *devmap,
 			if (strcmp(c->update, "uuid") == 0 && !ident->uuid_set)
 				random_uuid((__u8 *)ident->uuid);
 
+			if (strcmp(c->update, "ppl") == 0 &&
+			    ident->bitmap_fd >= 0) {
+				pr_err("PPL is not compatible with bitmap\n");
+				return -1;
+			}
+
 			dfd = dev_open(devname,
 				       tmpdev->disposition == 'I'
 				       ? O_RDWR : (O_RDWR|O_EXCL));
diff --git a/mdadm.8.in b/mdadm.8.in
index cad5db53..1178ed9b 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -1176,6 +1176,8 @@ argument given to this flag can be one of
 .BR no\-bitmap ,
 .BR bbl ,
 .BR no\-bbl ,
+.BR ppl ,
+.BR no\-ppl ,
 .BR metadata ,
 or
 .BR super\-minor .
@@ -1316,6 +1318,16 @@ option will cause any reservation of space for a bad block list to be
 removed.  If the bad block list contains entries, this will fail, as
 removing the list could cause data corruption.
 
+The
+.B ppl
+option will enable PPL for a RAID5 array and reserve space for PPL on each
+device. There must be enough free space between the data and superblock and a
+write-intent bitmap or journal must not be used.
+
+The
+.B no\-ppl
+option will disable PPL in the superblock.
+
 .TP
 .BR \-\-freeze\-reshape
 Option is intended to be used in start-up scripts during initrd boot phase.
@@ -2327,9 +2339,11 @@ superblock field in the subarray.  Similar to updating an array in
 .B \-U
 or
 .B \-\-update=
-option.  Currently only
-.B name
-is supported.
+option. The supported options are
+.BR name ,
+.B ppl
+and
+.BR no\-ppl .
 
 The
 .B name
@@ -2340,6 +2354,13 @@ re\-assembled.  If updating
 would change the UUID of an active subarray this operation is blocked,
 and the command will end in an error.
 
+The
+.B ppl
+and
+.B no\-ppl
+options enable and disable PPL in the metadata. Currently supported only for
+IMSM subarrays.
+
 .TP
 .B \-\-examine
 The device should be a component of an md array.
diff --git a/mdadm.c b/mdadm.c
index 65431b76..3d0da1ec 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -769,6 +769,10 @@ int main(int argc, char *argv[])
 				continue;
 			if (strcmp(c.update, "force-no-bbl") == 0)
 				continue;
+			if (strcmp(c.update, "ppl") == 0)
+				continue;
+			if (strcmp(c.update, "no-ppl") == 0)
+				continue;
 			if (strcmp(c.update, "metadata") == 0)
 				continue;
 			if (strcmp(c.update, "revert-reshape") == 0)
@@ -802,7 +806,7 @@ int main(int argc, char *argv[])
 		"     'sparc2.2', 'super-minor', 'uuid', 'name', 'nodes', 'resync',\n"
 		"     'summaries', 'homehost', 'home-cluster', 'byteorder', 'devicesize',\n"
 		"     'no-bitmap', 'metadata', 'revert-reshape'\n"
-		"     'bbl', 'no-bbl', 'force-no-bbl'\n"
+		"     'bbl', 'no-bbl', 'force-no-bbl', 'ppl', 'no-ppl'\n"
 				);
 			exit(outf == stdout ? 0 : 2);
 
diff --git a/super-intel.c b/super-intel.c
index ad3a4536..53fab8a3 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -451,6 +451,7 @@ enum imsm_update_type {
 	update_general_migration_checkpoint,
 	update_size_change,
 	update_prealloc_badblocks_mem,
+	update_rwh_policy,
 };
 
 struct imsm_update_activate_spare {
@@ -543,6 +544,12 @@ struct imsm_update_prealloc_bb_mem {
 	enum imsm_update_type type;
 };
 
+struct imsm_update_rwh_policy {
+	enum imsm_update_type type;
+	int new_policy;
+	int dev_idx;
+};
+
 static const char *_sys_dev_type[] = {
 	[SYS_DEV_UNKNOWN] = "Unknown",
 	[SYS_DEV_SAS] = "SAS",
@@ -7370,6 +7377,34 @@ static int update_subarray_imsm(struct supertype *st, char *subarray,
 			}
 			super->updates_pending++;
 		}
+	} else if (strcmp(update, "ppl") == 0 ||
+		   strcmp(update, "no-ppl") == 0) {
+		int new_policy;
+		char *ep;
+		int vol = strtoul(subarray, &ep, 10);
+
+		if (*ep != '\0' || vol >= super->anchor->num_raid_devs)
+			return 2;
+
+		if (strcmp(update, "ppl") == 0)
+			new_policy = RWH_DISTRIBUTED;
+		else
+			new_policy = RWH_OFF;
+
+		if (st->update_tail) {
+			struct imsm_update_rwh_policy *u = xmalloc(sizeof(*u));
+
+			u->type = update_rwh_policy;
+			u->dev_idx = vol;
+			u->new_policy = new_policy;
+			append_metadata_update(st, u, sizeof(*u));
+		} else {
+			struct imsm_dev *dev;
+
+			dev = get_imsm_dev(super, vol);
+			dev->rwh_policy = new_policy;
+			super->updates_pending++;
+		}
 	} else
 		return 2;
 
@@ -9596,6 +9631,21 @@ static void imsm_process_update(struct supertype *st,
 	}
 	case update_prealloc_badblocks_mem:
 		break;
+	case update_rwh_policy: {
+		struct imsm_update_rwh_policy *u = (void *)update->buf;
+		int target = u->dev_idx;
+		struct imsm_dev *dev = get_imsm_dev(super, target);
+		if (!dev) {
+			dprintf("could not find subarray-%d\n", target);
+			break;
+		}
+
+		if (dev->rwh_policy != u->new_policy) {
+			dev->rwh_policy = u->new_policy;
+			super->updates_pending++;
+		}
+		break;
+	}
 	default:
 		pr_err("error: unsuported process update type:(type: %d)\n",	type);
 	}
@@ -9841,6 +9891,11 @@ static int imsm_prepare_update(struct supertype *st,
 		super->extra_space += sizeof(struct bbm_log) -
 			get_imsm_bbm_log_size(super->bbm_log);
 		break;
+	case update_rwh_policy: {
+		if (update->len < (int)sizeof(struct imsm_update_rwh_policy))
+			return 0;
+		break;
+	}
 	default:
 		return 0;
 	}
diff --git a/super1.c b/super1.c
index 76eeca11..541f31ee 100644
--- a/super1.c
+++ b/super1.c
@@ -1325,6 +1325,55 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		sb->bblog_size = 0;
 		sb->bblog_shift = 0;
 		sb->bblog_offset = 0;
+	} else if (strcmp(update, "ppl") == 0) {
+		unsigned long long sb_offset = __le64_to_cpu(sb->super_offset);
+		unsigned long long data_offset = __le64_to_cpu(sb->data_offset);
+		unsigned long long data_size = __le64_to_cpu(sb->data_size);
+		long bb_offset = __le32_to_cpu(sb->bblog_offset);
+		int space;
+		int optimal_space;
+		int offset;
+
+		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
+			pr_err("Cannot add PPL to array with bitmap\n");
+			return -2;
+		}
+
+		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_JOURNAL)) {
+			pr_err("Cannot add PPL to array with journal\n");
+			return -2;
+		}
+
+		if (sb_offset < data_offset) {
+			if (bb_offset)
+				space = bb_offset - 8;
+			else
+				space = data_offset - sb_offset - 8;
+			offset = 8;
+		} else {
+			offset = -(sb_offset - data_offset - data_size);
+			if (offset < INT16_MIN)
+				offset = INT16_MIN;
+			space = -(offset - bb_offset);
+		}
+
+		if (space < (PPL_HEADER_SIZE >> 9) + 8) {
+			pr_err("Not enough space to add ppl\n");
+			return -2;
+		}
+
+		optimal_space = choose_ppl_space(__le32_to_cpu(sb->chunksize));
+
+		if (space > optimal_space)
+			space = optimal_space;
+		if (space > UINT16_MAX)
+			space = UINT16_MAX;
+
+		sb->ppl.offset = __cpu_to_le16(offset);
+		sb->ppl.size = __cpu_to_le16(space);
+		sb->feature_map |= __cpu_to_le32(MD_FEATURE_PPL);
+	} else if (strcmp(update, "no-ppl") == 0) {
+		sb->feature_map &= ~ __cpu_to_le32(MD_FEATURE_PPL);
 	} else if (strcmp(update, "name") == 0) {
 		if (info->name[0] == 0)
 			sprintf(info->name, "%d", info->array.md_minor);
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 4/6] super1: PPL support
From: Artur Paszkiewicz @ 2017-03-16 21:09 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170316210948.21093-1-artur.paszkiewicz@intel.com>

Enable creating and assembling raid5 arrays with PPL for 1.x metadata.

When creating, reserve enough space for PPL and store its size and
location in the superblock and set MD_FEATURE_PPL bit. Write an initial
empty header in the PPL area on each device. PPL is stored in the
metadata region reserved for internal write-intent bitmap, so don't
allow using bitmap and PPL together.

While at it, fix two endianness issues in write_empty_r5l_meta_block()
and write_init_super1().

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Assemble.c    |   3 ++
 Create.c      |   2 +
 Grow.c        |  15 +++++-
 Incremental.c |   3 ++
 mdadm.h       |   1 +
 super1.c      | 150 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 155 insertions(+), 19 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 8e55b49f..c0984201 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -962,6 +962,9 @@ static int start_array(int mdfd,
 		c->readonly = 1;
 	}
 
+	if (content->consistency_policy == CONSISTENCY_POLICY_PPL)
+		clean = 1;
+
 	rv = set_array_info(mdfd, st, content);
 	if (rv && !err_ok) {
 		pr_err("failed to set array info for %s: %s\n",
diff --git a/Create.c b/Create.c
index 4080bf69..10e7d108 100644
--- a/Create.c
+++ b/Create.c
@@ -524,6 +524,8 @@ int Create(struct supertype *st, char *mddev,
 	if (!s->bitmap_file &&
 	    s->level >= 1 &&
 	    st->ss->add_internal_bitmap &&
+	    (s->consistency_policy != CONSISTENCY_POLICY_RESYNC &&
+	     s->consistency_policy != CONSISTENCY_POLICY_PPL) &&
 	    (s->write_behind || s->size > 100*1024*1024ULL)) {
 		if (c->verbose > 0)
 			pr_err("automatically enabling write-intent bitmap on large array\n");
diff --git a/Grow.c b/Grow.c
index 455c5f90..e4351d7f 100755
--- a/Grow.c
+++ b/Grow.c
@@ -290,6 +290,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 	int major = BITMAP_MAJOR_HI;
 	int vers = md_get_version(fd);
 	unsigned long long bitmapsize, array_size;
+	struct mdinfo *mdi;
 
 	if (vers < 9003) {
 		major = BITMAP_MAJOR_HOSTENDIAN;
@@ -389,12 +390,23 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 		free(st);
 		return 1;
 	}
+
+	mdi = sysfs_read(fd, NULL, GET_CONSISTENCY_POLICY);
+	if (mdi) {
+		if (mdi->consistency_policy == CONSISTENCY_POLICY_PPL) {
+			pr_err("Cannot add bitmap to array with PPL\n");
+			free(mdi);
+			free(st);
+			return 1;
+		}
+		free(mdi);
+	}
+
 	if (strcmp(s->bitmap_file, "internal") == 0 ||
 	    strcmp(s->bitmap_file, "clustered") == 0) {
 		int rv;
 		int d;
 		int offset_setable = 0;
-		struct mdinfo *mdi;
 		if (st->ss->add_internal_bitmap == NULL) {
 			pr_err("Internal bitmaps not supported with %s metadata\n", st->ss->name);
 			return 1;
@@ -446,6 +458,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 			sysfs_init(mdi, fd, NULL);
 			rv = sysfs_set_num_signed(mdi, NULL, "bitmap/location",
 						  mdi->bitmap_offset);
+			free(mdi);
 		} else {
 			if (strcmp(s->bitmap_file, "clustered") == 0)
 				array.state |= (1 << MD_SB_CLUSTERED);
diff --git a/Incremental.c b/Incremental.c
index 0f507bb3..81afc7ec 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -528,6 +528,9 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 
 	journal_device_missing = (info.journal_device_required) && (info.journal_clean == 0);
 
+	if (info.consistency_policy == CONSISTENCY_POLICY_PPL)
+		info.array.state |= 1;
+
 	if (enough(info.array.level, info.array.raid_disks,
 		   info.array.layout, info.array.state & 1,
 		   avail) == 0) {
diff --git a/mdadm.h b/mdadm.h
index 10c20416..ab1b7fc6 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -302,6 +302,7 @@ struct mdinfo {
 	long			bitmap_offset;	/* 0 == none, 1 == a file */
 	unsigned int		ppl_size;
 	unsigned long long	ppl_sector;
+	int			ppl_offset;
 	unsigned long		safe_mode_delay; /* ms delay to mark clean */
 	int			new_level, delta_disks, new_layout, new_chunk;
 	int			errors;
diff --git a/super1.c b/super1.c
index 672cdde6..76eeca11 100644
--- a/super1.c
+++ b/super1.c
@@ -48,10 +48,18 @@ struct mdp_superblock_1 {
 
 	__u32	chunksize;	/* in 512byte sectors */
 	__u32	raid_disks;
-	__u32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
-				 * NOTE: signed, so bitmap can be before superblock
-				 * only meaningful of feature_map[0] is set.
-				 */
+	union {
+		__u32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
+					 * NOTE: signed, so bitmap can be before superblock
+					 * only meaningful of feature_map[0] is set.
+					 */
+
+		/* only meaningful when feature_map[MD_FEATURE_PPL] is set */
+		struct {
+			__s16 offset; /* sectors from start of superblock that ppl starts */
+			__u16 size; /* ppl size in sectors */
+		} ppl;
+	};
 
 	/* These are only valid with feature bit '4' */
 	__u32	new_level;	/* new level we are reshaping to		*/
@@ -131,6 +139,7 @@ struct misc_dev_info {
 #define	MD_FEATURE_NEW_OFFSET		64 /* new_offset must be honoured */
 #define	MD_FEATURE_BITMAP_VERSIONED	256 /* bitmap version number checked properly */
 #define	MD_FEATURE_JOURNAL		512 /* support write journal */
+#define	MD_FEATURE_PPL			1024 /* support PPL */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
@@ -140,6 +149,7 @@ struct misc_dev_info {
 					|MD_FEATURE_NEW_OFFSET		\
 					|MD_FEATURE_BITMAP_VERSIONED	\
 					|MD_FEATURE_JOURNAL		\
+					|MD_FEATURE_PPL			\
 					)
 
 #ifndef MDASSEMBLE
@@ -289,6 +299,11 @@ static int awrite(struct align_fd *afd, void *buf, int len)
 	return len;
 }
 
+static inline unsigned int choose_ppl_space(int chunk)
+{
+	return (PPL_HEADER_SIZE >> 9) + (chunk > 128*2 ? chunk : 128*2);
+}
+
 #ifndef MDASSEMBLE
 static void examine_super1(struct supertype *st, char *homehost)
 {
@@ -392,6 +407,10 @@ static void examine_super1(struct supertype *st, char *homehost)
 	if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
 		printf("Internal Bitmap : %ld sectors from superblock\n",
 		       (long)(int32_t)__le32_to_cpu(sb->bitmap_offset));
+	} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+		printf("            PPL : %u sectors at offset %d sectors from superblock\n",
+		       __le16_to_cpu(sb->ppl.size),
+		       __le16_to_cpu(sb->ppl.offset));
 	}
 	if (sb->feature_map & __cpu_to_le32(MD_FEATURE_RESHAPE_ACTIVE)) {
 		printf("  Reshape pos'n : %llu%s\n", (unsigned long long)__le64_to_cpu(sb->reshape_position)/2,
@@ -934,10 +953,16 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	if (__le32_to_cpu(bsb->nodes) > 1)
 		info->array.state |= (1 << MD_SB_CLUSTERED);
 
+	super_offset = __le64_to_cpu(sb->super_offset);
 	info->data_offset = __le64_to_cpu(sb->data_offset);
 	info->component_size = __le64_to_cpu(sb->size);
-	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_BITMAP_OFFSET))
+	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_BITMAP_OFFSET)) {
 		info->bitmap_offset = (int32_t)__le32_to_cpu(sb->bitmap_offset);
+	} else if (sb->feature_map & __le32_to_cpu(MD_FEATURE_PPL)) {
+		info->ppl_offset = __le16_to_cpu(sb->ppl.offset);
+		info->ppl_size = __le16_to_cpu(sb->ppl.size);
+		info->ppl_sector = super_offset + info->ppl_offset;
+	}
 
 	info->disk.major = 0;
 	info->disk.minor = 0;
@@ -948,7 +973,6 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	else
 		role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]);
 
-	super_offset = __le64_to_cpu(sb->super_offset);
 	if (info->array.level <= 0)
 		data_size = __le64_to_cpu(sb->data_size);
 	else
@@ -965,8 +989,8 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 				end = bboffset;
 		}
 
-		if (super_offset + info->bitmap_offset < end)
-			end = super_offset + info->bitmap_offset;
+		if (super_offset + info->bitmap_offset + info->ppl_offset < end)
+			end = super_offset + info->bitmap_offset + info->ppl_offset;
 
 		if (info->data_offset + data_size < end)
 			info->space_after = end - data_size - info->data_offset;
@@ -982,6 +1006,11 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 			bmend += size;
 			if (bmend > earliest)
 				earliest = bmend;
+		} else if (info->ppl_offset > 0) {
+			unsigned long long pplend = info->ppl_offset +
+						    info->ppl_size;
+			if (pplend > earliest)
+				earliest = pplend;
 		}
 		if (sb->bblog_offset && sb->bblog_size) {
 			unsigned long long bbend = super_offset;
@@ -1075,8 +1104,20 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	}
 
 	info->array.working_disks = working;
-	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_JOURNAL))
+
+	if (sb->feature_map & __le32_to_cpu(MD_FEATURE_JOURNAL)) {
 		info->journal_device_required = 1;
+		info->consistency_policy = CONSISTENCY_POLICY_JOURNAL;
+	} else if (sb->feature_map & __le32_to_cpu(MD_FEATURE_PPL)) {
+		info->consistency_policy = CONSISTENCY_POLICY_PPL;
+	} else if (sb->feature_map & __le32_to_cpu(MD_FEATURE_BITMAP_OFFSET)) {
+		info->consistency_policy = CONSISTENCY_POLICY_BITMAP;
+	} else if (info->array.level <= 0) {
+		info->consistency_policy = CONSISTENCY_POLICY_NONE;
+	} else {
+		info->consistency_policy = CONSISTENCY_POLICY_RESYNC;
+	}
+
 	info->journal_clean = 0;
 }
 
@@ -1239,6 +1280,9 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		if (sb->feature_map & __cpu_to_le32(MD_FEATURE_BITMAP_OFFSET)) {
 			bitmap_offset = (long)__le32_to_cpu(sb->bitmap_offset);
 			bm_sectors = calc_bitmap_size(bms, 4096) >> 9;
+		} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+			bitmap_offset = (long)__le16_to_cpu(sb->ppl.offset);
+			bm_sectors = (long)__le16_to_cpu(sb->ppl.size);
 		}
 #endif
 		if (sb_offset < data_offset) {
@@ -1472,6 +1516,9 @@ static int init_super1(struct supertype *st, mdu_array_info_t *info,
 
 	memset(sb->dev_roles, 0xff, MAX_SB_SIZE - sizeof(struct mdp_superblock_1));
 
+	if (s->consistency_policy == CONSISTENCY_POLICY_PPL)
+		sb->feature_map |= __cpu_to_le32(MD_FEATURE_PPL);
+
 	return 1;
 }
 
@@ -1643,10 +1690,49 @@ static unsigned long choose_bm_space(unsigned long devsize)
 
 static void free_super1(struct supertype *st);
 
-#define META_BLOCK_SIZE 4096
+#ifndef MDASSEMBLE
+
 __u32 crc32c_le(__u32 crc, unsigned char const *p, size_t len);
 
-#ifndef MDASSEMBLE
+static int write_init_ppl1(struct supertype *st, struct mdinfo *info, int fd)
+{
+	struct mdp_superblock_1 *sb = st->sb;
+	void *buf;
+	struct ppl_header *ppl_hdr;
+	int ret;
+
+	ret = posix_memalign(&buf, 4096, PPL_HEADER_SIZE);
+	if (ret) {
+		pr_err("Failed to allocate PPL header buffer\n");
+		return ret;
+	}
+
+	memset(buf, 0, PPL_HEADER_SIZE);
+	ppl_hdr = buf;
+	memset(ppl_hdr->reserved, 0xff, PPL_HDR_RESERVED);
+	ppl_hdr->signature = __cpu_to_le32(~crc32c_le(~0, sb->set_uuid,
+						      sizeof(sb->set_uuid)));
+	ppl_hdr->checksum = __cpu_to_le32(~crc32c_le(~0, buf, PPL_HEADER_SIZE));
+
+	if (lseek64(fd, info->ppl_sector * 512, SEEK_SET) < 0) {
+		ret = errno;
+		perror("Failed to seek to PPL header location");
+	}
+
+	if (!ret && write(fd, buf, PPL_HEADER_SIZE) != PPL_HEADER_SIZE) {
+		ret = errno;
+		perror("Write PPL header failed");
+	}
+
+	if (!ret)
+		fsync(fd);
+
+	free(buf);
+	return ret;
+}
+
+#define META_BLOCK_SIZE 4096
+
 static int write_empty_r5l_meta_block(struct supertype *st, int fd)
 {
 	struct r5l_meta_block *mb;
@@ -1673,7 +1759,7 @@ static int write_empty_r5l_meta_block(struct supertype *st, int fd)
 	crc = crc32c_le(crc, (void *)mb, META_BLOCK_SIZE);
 	mb->checksum = crc;
 
-	if (lseek64(fd, (sb->data_offset) * 512, 0) < 0LL) {
+	if (lseek64(fd, __le64_to_cpu(sb->data_offset) * 512, 0) < 0LL) {
 		pr_err("cannot seek to offset of the meta block\n");
 		goto fail_to_write;
 	}
@@ -1706,7 +1792,7 @@ static int write_init_super1(struct supertype *st)
 
 	for (di = st->info; di; di = di->next) {
 		if (di->disk.state & (1 << MD_DISK_JOURNAL))
-			sb->feature_map |= MD_FEATURE_JOURNAL;
+			sb->feature_map |= __cpu_to_le32(MD_FEATURE_JOURNAL);
 	}
 
 	for (di = st->info; di; di = di->next) {
@@ -1781,6 +1867,21 @@ static int write_init_super1(struct supertype *st)
 					(((char *)sb) + MAX_SB_SIZE);
 			bm_space = calc_bitmap_size(bms, 4096) >> 9;
 			bm_offset = (long)__le32_to_cpu(sb->bitmap_offset);
+		} else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_PPL)) {
+			bm_space = choose_ppl_space(__le32_to_cpu(sb->chunksize));
+			if (bm_space > UINT16_MAX)
+				bm_space = UINT16_MAX;
+			if (st->minor_version == 0) {
+				bm_offset = -bm_space - 8;
+				if (bm_offset < INT16_MIN) {
+					bm_offset = INT16_MIN;
+					bm_space = -bm_offset - 8;
+				}
+			} else {
+				bm_offset = 8;
+			}
+			sb->ppl.offset = __cpu_to_le16(bm_offset);
+			sb->ppl.size = __cpu_to_le16(bm_space);
 		} else {
 			bm_space = choose_bm_space(array_size);
 			bm_offset = 8;
@@ -1852,8 +1953,17 @@ static int write_init_super1(struct supertype *st)
 				goto error_out;
 		}
 
-		if (rv == 0 && (__le32_to_cpu(sb->feature_map) & 1))
+		if (rv == 0 &&
+		    (__le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET)) {
 			rv = st->ss->write_bitmap(st, di->fd, NodeNumUpdate);
+		} else if (rv == 0 &&
+			 (__le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL)) {
+			struct mdinfo info;
+
+			st->ss->getinfo_super(st, &info, NULL);
+			rv = st->ss->write_init_ppl(st, &info, di->fd);
+		}
+
 		close(di->fd);
 		di->fd = -1;
 		if (rv)
@@ -2121,11 +2231,13 @@ static __u64 avail_size1(struct supertype *st, __u64 devsize,
 		return 0;
 
 #ifndef MDASSEMBLE
-	if (__le32_to_cpu(super->feature_map)&MD_FEATURE_BITMAP_OFFSET) {
+	if (__le32_to_cpu(super->feature_map) & MD_FEATURE_BITMAP_OFFSET) {
 		/* hot-add. allow for actual size of bitmap */
 		struct bitmap_super_s *bsb;
 		bsb = (struct bitmap_super_s *)(((char*)super)+MAX_SB_SIZE);
 		bmspace = calc_bitmap_size(bsb, 4096) >> 9;
+	} else if (__le32_to_cpu(super->feature_map) & MD_FEATURE_PPL) {
+		bmspace = __le16_to_cpu(super->ppl.size);
 	}
 #endif
 	/* Allow space for bad block log */
@@ -2528,8 +2640,9 @@ static int validate_geometry1(struct supertype *st, int level,
 		return 0;
 	}
 
-	/* creating:  allow suitable space for bitmap */
-	bmspace = choose_bm_space(devsize);
+	/* creating:  allow suitable space for bitmap or PPL */
+	bmspace = consistency_policy == CONSISTENCY_POLICY_PPL ?
+		  choose_ppl_space((*chunk)*2) : choose_bm_space(devsize);
 
 	if (data_offset == INVALID_SECTORS)
 		data_offset = st->data_offset;
@@ -2564,7 +2677,7 @@ static int validate_geometry1(struct supertype *st, int level,
 	switch(st->minor_version) {
 	case 0: /* metadata at end.  Round down and subtract space to reserve */
 		devsize = (devsize & ~(4ULL*2-1));
-		/* space for metadata, bblog, bitmap */
+		/* space for metadata, bblog, bitmap/ppl */
 		devsize -= 8*2 + 8 + bmspace;
 		break;
 	case 1:
@@ -2640,6 +2753,7 @@ struct superswitch super1 = {
 	.add_to_super = add_to_super1,
 	.examine_badblocks = examine_badblocks_super1,
 	.copy_metadata = copy_metadata1,
+	.write_init_ppl = write_init_ppl1,
 #endif
 	.match_home = match_home1,
 	.uuid_from_super = uuid_from_super1,
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 3/6] imsm: PPL support
From: Artur Paszkiewicz @ 2017-03-16 21:09 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170316210948.21093-1-artur.paszkiewicz@intel.com>

Enable creating and assembling IMSM raid5 arrays with PPL. Update the
IMSM metadata format to include new fields used for PPL.

Add structures for PPL metadata. They are used also by super1 and shared
with the kernel, so put them in md_p.h.

Write the initial empty PPL header when creating an array. When
assembling an array with PPL, validate the PPL header and in case it is
not correct allow to overwrite it if --force was provided.

Write the PPL location and size for a device to the new rdev sysfs
attributes 'ppl_sector' and 'ppl_size'. Enable PPL in the kernel by
writing to 'consistency_policy' before the array is activated.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Assemble.c    |  49 +++++++++++
 Makefile      |   5 +-
 md_p.h        |  25 ++++++
 mdadm.h       |   6 ++
 super-intel.c | 274 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 sysfs.c       |  14 +++
 6 files changed, 349 insertions(+), 24 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 3da09033..8e55b49f 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -1942,6 +1942,55 @@ int assemble_container_content(struct supertype *st, int mdfd,
 	map_update(NULL, fd2devnm(mdfd), content->text_version,
 		   content->uuid, chosen_name);
 
+	if (content->consistency_policy == CONSISTENCY_POLICY_PPL &&
+	    st->ss->validate_ppl) {
+		content->array.state |= 1;
+		err = 0;
+
+		for (dev = content->devs; dev; dev = dev->next) {
+			int dfd;
+			char *devpath;
+			int ret;
+
+			ret = st->ss->validate_ppl(st, content, dev);
+			if (ret == 0)
+				continue;
+
+			if (ret < 0) {
+				err = 1;
+				break;
+			}
+
+			if (!c->force) {
+				pr_err("%s contains invalid PPL - consider --force or --update-subarray with --update=no-ppl\n",
+					chosen_name);
+				content->array.state &= ~1;
+				avail[dev->disk.raid_disk] = 0;
+				break;
+			}
+
+			/* have --force - overwrite the invalid ppl */
+			devpath = map_dev(dev->disk.major, dev->disk.minor, 0);
+			dfd = dev_open(devpath, O_RDWR);
+			if (dfd < 0) {
+				pr_err("Failed to open %s\n", devpath);
+				err = 1;
+				break;
+			}
+
+			err = st->ss->write_init_ppl(st, content, dfd);
+			close(dfd);
+
+			if (err)
+				break;
+		}
+
+		if (err) {
+			free(avail);
+			return err;
+		}
+	}
+
 	if (enough(content->array.level, content->array.raid_disks,
 		   content->array.layout, content->array.state & 1, avail) == 0) {
 		if (c->export && result)
diff --git a/Makefile b/Makefile
index a6f464c3..49da491f 100644
--- a/Makefile
+++ b/Makefile
@@ -146,7 +146,7 @@ MON_OBJS = mdmon.o monitor.o managemon.o util.o maps.o mdstat.o sysfs.o \
 	Kill.o sg_io.o dlink.o ReadMe.o super-intel.o \
 	super-mbr.o super-gpt.o \
 	super-ddf.o sha1.o crc32.o msg.o bitmap.o xmalloc.o \
-	platform-intel.o probe_roms.o
+	platform-intel.o probe_roms.o crc32c.o
 
 MON_SRCS = $(patsubst %.o,%.c,$(MON_OBJS))
 
@@ -156,7 +156,8 @@ STATICOBJS = pwgr.o
 ASSEMBLE_SRCS := mdassemble.c Assemble.c Manage.c config.c policy.c dlink.c util.c \
 	maps.c lib.c xmalloc.c \
 	super0.c super1.c super-ddf.c super-intel.c sha1.c crc32.c sg_io.c mdstat.c \
-	platform-intel.c probe_roms.c sysfs.c super-mbr.c super-gpt.c mapfile.c
+	platform-intel.c probe_roms.c sysfs.c super-mbr.c super-gpt.c mapfile.c \
+	crc32c.o
 ASSEMBLE_AUTO_SRCS := mdopen.c
 ASSEMBLE_FLAGS:= $(CFLAGS) -DMDASSEMBLE
 ifdef MDASSEMBLE_AUTO
diff --git a/md_p.h b/md_p.h
index dc9fec16..358a28ce 100644
--- a/md_p.h
+++ b/md_p.h
@@ -267,4 +267,29 @@ struct r5l_meta_block {
 #define R5LOG_VERSION 0x1
 #define R5LOG_MAGIC 0x6433c509
 
+struct ppl_header_entry {
+	__u64 data_sector;	/* raid sector of the new data */
+	__u32 pp_size;		/* length of partial parity */
+	__u32 data_size;	/* length of data */
+	__u32 parity_disk;	/* member disk containing parity */
+	__u32 checksum;		/* checksum of this entry's partial parity */
+} __attribute__ ((__packed__));
+
+#define PPL_HEADER_SIZE 4096
+#define PPL_HDR_RESERVED 512
+#define PPL_HDR_ENTRY_SPACE \
+	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(__u32) - sizeof(__u64))
+#define PPL_HDR_MAX_ENTRIES \
+	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
+
+struct ppl_header {
+	__u8 reserved[PPL_HDR_RESERVED];/* reserved space, fill with 0xff */
+	__u32 signature;		/* signature (family number of volume) */
+	__u32 padding;			/* zero pad */
+	__u64 generation;		/* generation number of the header */
+	__u32 entries_count;		/* number of entries in entry array */
+	__u32 checksum;			/* checksum of the header */
+	struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
+} __attribute__ ((__packed__));
+
 #endif
diff --git a/mdadm.h b/mdadm.h
index ed4d7e4e..10c20416 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -300,6 +300,8 @@ struct mdinfo {
 		#define MaxSector  (~0ULL) /* resync/recovery complete position */
 	};
 	long			bitmap_offset;	/* 0 == none, 1 == a file */
+	unsigned int		ppl_size;
+	unsigned long long	ppl_sector;
 	unsigned long		safe_mode_delay; /* ms delay to mark clean */
 	int			new_level, delta_disks, new_layout, new_chunk;
 	int			errors;
@@ -1074,6 +1076,10 @@ extern struct superswitch {
 	/* write initial empty PPL on device */
 	int (*write_init_ppl)(struct supertype *st, struct mdinfo *info, int fd);
 
+	/* validate ppl before assemble */
+	int (*validate_ppl)(struct supertype *st, struct mdinfo *info,
+			    struct mdinfo *disk);
+
 	/* records new bad block in metadata */
 	int (*record_bad_block)(struct active_array *a, int n,
 					unsigned long long sector, int length);
diff --git a/super-intel.c b/super-intel.c
index 120ce77c..ad3a4536 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -102,6 +102,7 @@ struct imsm_disk {
 #define SPARE_DISK      __cpu_to_le32(0x01)  /* Spare */
 #define CONFIGURED_DISK __cpu_to_le32(0x02)  /* Member of some RaidDev */
 #define FAILED_DISK     __cpu_to_le32(0x04)  /* Permanent failure */
+#define JOURNAL_DISK    __cpu_to_le32(0x2000000) /* Device marked as Journaling Drive */
 	__u32 status;			 /* 0xF0 - 0xF3 */
 	__u32 owner_cfg_num; /* which config 0,1,2... owns this disk */
 	__u32 total_blocks_hi;		 /* 0xF4 - 0xF5 total blocks hi */
@@ -155,6 +156,9 @@ struct imsm_vol {
 #define MIGR_STATE_CHANGE 4
 #define MIGR_REPAIR 5
 	__u8  migr_type;	/* Initializing, Rebuilding, ... */
+#define RAIDVOL_CLEAN          0
+#define RAIDVOL_DIRTY          1
+#define RAIDVOL_DSRECORD_VALID 2
 	__u8  dirty;
 	__u8  fs_state;		/* fast-sync state for CnG (0xff == disabled) */
 	__u16 verify_errors;	/* number of mismatches */
@@ -190,7 +194,24 @@ struct imsm_dev {
 	__u16 cache_policy;
 	__u8  cng_state;
 	__u8  cng_sub_state;
-#define IMSM_DEV_FILLERS 10
+	__u16 my_vol_raid_dev_num; /* Used in Unique volume Id for this RaidDev */
+
+	/* NVM_EN */
+	__u8 nv_cache_mode;
+	__u8 nv_cache_flags;
+
+	/* Unique Volume Id of the NvCache Volume associated with this volume */
+	__u32 nvc_vol_orig_family_num;
+	__u16 nvc_vol_raid_dev_num;
+
+#define RWH_OFF 0
+#define RWH_DISTRIBUTED 1
+#define RWH_JOURNALING_DRIVE 2
+	__u8  rwh_policy; /* Raid Write Hole Policy */
+	__u8  jd_serial[MAX_RAID_SERIAL_LEN]; /* Journal Drive serial number */
+	__u8  filler1;
+
+#define IMSM_DEV_FILLERS 3
 	__u32 filler[IMSM_DEV_FILLERS];
 	struct imsm_vol vol;
 } __attribute__ ((packed));
@@ -257,6 +278,9 @@ static char *map_state_str[] = { "normal", "uninitialized", "degraded", "failed"
 #define UNIT_SRC_IN_CP_AREA 1   /* Source data for curr_migr_unit has
 				 *  already been migrated and must
 				 *  be recovered from checkpoint area */
+
+#define PPL_ENTRY_SPACE (128 * 1024) /* Size of the PPL, without the header */
+
 struct migr_record {
 	__u32 rec_status;	    /* Status used to determine how to restart
 				     * migration in case it aborts
@@ -1288,6 +1312,11 @@ static int is_failed(struct imsm_disk *disk)
 	return (disk->status & FAILED_DISK) == FAILED_DISK;
 }
 
+static int is_journal(struct imsm_disk *disk)
+{
+	return (disk->status & JOURNAL_DISK) == JOURNAL_DISK;
+}
+
 /* try to determine how much space is reserved for metadata from
  * the last get_extents() entry on the smallest active disk,
  * otherwise fallback to the default
@@ -1477,7 +1506,17 @@ static void print_imsm_dev(struct intel_super *super,
 				   blocks_per_migr_unit(super, dev));
 	}
 	printf("\n");
-	printf("    Dirty State : %s\n", dev->vol.dirty ? "dirty" : "clean");
+	printf("    Dirty State : %s\n", (dev->vol.dirty & RAIDVOL_DIRTY) ?
+					 "dirty" : "clean");
+	printf("     RWH Policy : ");
+	if (dev->rwh_policy == RWH_OFF)
+		printf("off\n");
+	else if (dev->rwh_policy == RWH_DISTRIBUTED)
+		printf("PPL distributed\n");
+	else if (dev->rwh_policy == RWH_JOURNALING_DRIVE)
+		printf("PPL journaling drive\n");
+	else
+		printf("<unknown:%d>\n", dev->rwh_policy);
 }
 
 static void print_imsm_disk(struct imsm_disk *disk,
@@ -1496,9 +1535,10 @@ static void print_imsm_disk(struct imsm_disk *disk,
 		printf("  Disk%02d Serial : %s\n", index, str);
 	else
 		printf("    Disk Serial : %s\n", str);
-	printf("          State :%s%s%s\n", is_spare(disk) ? " spare" : "",
-					    is_configured(disk) ? " active" : "",
-					    is_failed(disk) ? " failed" : "");
+	printf("          State :%s%s%s%s\n", is_spare(disk) ? " spare" : "",
+					      is_configured(disk) ? " active" : "",
+					      is_failed(disk) ? " failed" : "",
+					      is_journal(disk) ? " journal" : "");
 	printf("             Id : %08x\n", __le32_to_cpu(disk->scsi_id));
 	sz = total_blocks(disk) - reserved;
 	printf("    Usable Size : %llu%s\n",
@@ -3113,6 +3153,15 @@ static unsigned long long imsm_component_size_aligment_check(int level,
 	return component_size;
 }
 
+static unsigned long long get_ppl_sector(struct intel_super *super, int dev_idx)
+{
+	struct imsm_dev *dev = get_imsm_dev(super, dev_idx);
+	struct imsm_map *map = get_imsm_map(dev, MAP_0);
+
+	return pba_of_lba0(map) +
+	       (num_data_stripes(map) * map->blocks_per_strip);
+}
+
 static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info, char *dmap)
 {
 	struct intel_super *super = st->sb;
@@ -3139,7 +3188,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	info->array.utime	  = 0;
 	info->array.chunk_size	  =
 		__le16_to_cpu(map_to_analyse->blocks_per_strip) << 9;
-	info->array.state	  = !dev->vol.dirty;
+	info->array.state	  = !(dev->vol.dirty & RAIDVOL_DIRTY);
 	info->custom_array_size   = __le32_to_cpu(dev->size_high);
 	info->custom_array_size   <<= 32;
 	info->custom_array_size   |= __le32_to_cpu(dev->size_low);
@@ -3220,10 +3269,20 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 	memset(info->uuid, 0, sizeof(info->uuid));
 	info->recovery_start = MaxSector;
 
+	if (info->array.level == 5 && dev->rwh_policy == RWH_DISTRIBUTED) {
+		info->consistency_policy = CONSISTENCY_POLICY_PPL;
+		info->ppl_sector = get_ppl_sector(super, super->current_vol);
+		info->ppl_size = (PPL_HEADER_SIZE + PPL_ENTRY_SPACE) >> 9;
+	} else if (info->array.level <= 0) {
+		info->consistency_policy = CONSISTENCY_POLICY_NONE;
+	} else {
+		info->consistency_policy = CONSISTENCY_POLICY_RESYNC;
+	}
+
 	info->reshape_progress = 0;
 	info->resync_start = MaxSector;
 	if ((map_to_analyse->map_state == IMSM_T_STATE_UNINITIALIZED ||
-	    dev->vol.dirty) &&
+	    !(info->array.state & 1)) &&
 	    imsm_reshape_blocks_arrays_changes(super) == 0) {
 		info->resync_start = 0;
 	}
@@ -3450,7 +3509,8 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
 		 * found the 'most fresh' version of the metadata
 		 */
 		info->disk.state |= is_failed(disk) ? (1 << MD_DISK_FAULTY) : 0;
-		info->disk.state |= is_spare(disk) ? 0 : (1 << MD_DISK_SYNC);
+		info->disk.state |= (is_spare(disk) || is_journal(disk)) ?
+				    0 : (1 << MD_DISK_SYNC);
 	}
 
 	/* only call uuid_from_super_imsm when this disk is part of a populated container,
@@ -3905,7 +3965,7 @@ load_imsm_disk(int fd, struct intel_super *super, char *devname, int keep_fd)
 		 */
 		if (is_failed(&dl->disk))
 			dl->index = -2;
-		else if (is_spare(&dl->disk))
+		else if (is_spare(&dl->disk) || is_journal(&dl->disk))
 			dl->index = -1;
 	}
 
@@ -5302,6 +5362,20 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	}
 	mpb->num_raid_devs++;
 
+	if (s->consistency_policy == UnSet ||
+	    s->consistency_policy == CONSISTENCY_POLICY_RESYNC ||
+	    s->consistency_policy == CONSISTENCY_POLICY_NONE) {
+		dev->rwh_policy = RWH_OFF;
+	} else if (s->consistency_policy == CONSISTENCY_POLICY_PPL) {
+		dev->rwh_policy = RWH_DISTRIBUTED;
+	} else {
+		free(dev);
+		free(dv);
+		pr_err("imsm does not support consistency policy %s\n",
+		       map_num(consistency_policies, s->consistency_policy));
+		return 0;
+	}
+
 	dv->dev = dev;
 	dv->index = super->current_vol;
 	dv->next = super->devlist;
@@ -5926,11 +6000,146 @@ static int mgmt_disk(struct supertype *st)
 
 	return 0;
 }
+#endif
+
+__u32 crc32c_le(__u32 crc, unsigned char const *p, size_t len);
+
+static int write_init_ppl_imsm(struct supertype *st, struct mdinfo *info, int fd)
+{
+	struct intel_super *super = st->sb;
+	void *buf;
+	struct ppl_header *ppl_hdr;
+	int ret;
+
+	ret = posix_memalign(&buf, 4096, PPL_HEADER_SIZE);
+	if (ret) {
+		pr_err("Failed to allocate PPL header buffer\n");
+		return ret;
+	}
+
+	memset(buf, 0, PPL_HEADER_SIZE);
+	ppl_hdr = buf;
+	memset(ppl_hdr->reserved, 0xff, PPL_HDR_RESERVED);
+	ppl_hdr->signature = __cpu_to_le32(super->anchor->orig_family_num);
+	ppl_hdr->checksum = __cpu_to_le32(~crc32c_le(~0, buf, PPL_HEADER_SIZE));
+
+	if (lseek64(fd, info->ppl_sector * 512, SEEK_SET) < 0) {
+		ret = errno;
+		perror("Failed to seek to PPL header location");
+	}
+
+	if (!ret && write(fd, buf, PPL_HEADER_SIZE) != PPL_HEADER_SIZE) {
+		ret = errno;
+		perror("Write PPL header failed");
+	}
+
+	if (!ret)
+		fsync(fd);
+
+	free(buf);
+	return ret;
+}
+
+static int validate_ppl_imsm(struct supertype *st, struct mdinfo *info,
+			     struct mdinfo *disk)
+{
+	struct intel_super *super = st->sb;
+	struct dl *d;
+	void *buf;
+	int ret = 0;
+	struct ppl_header *ppl_hdr;
+	__u32 crc;
+	struct imsm_dev *dev;
+	struct imsm_map *map;
+	__u32 idx;
+
+	if (disk->disk.raid_disk < 0)
+		return 0;
+
+	if (posix_memalign(&buf, 4096, PPL_HEADER_SIZE)) {
+		pr_err("Failed to allocate PPL header buffer\n");
+		return -1;
+	}
+
+	dev = get_imsm_dev(super, info->container_member);
+	map = get_imsm_map(dev, MAP_X);
+	idx = get_imsm_disk_idx(dev, disk->disk.raid_disk, MAP_X);
+	d = get_imsm_dl_disk(super, idx);
+
+	if (!d || d->index < 0 || is_failed(&d->disk))
+		goto out;
+
+	if (lseek64(d->fd, info->ppl_sector * 512, SEEK_SET) < 0) {
+		perror("Failed to seek to PPL header location");
+		ret = -1;
+		goto out;
+	}
+
+	if (read(d->fd, buf, PPL_HEADER_SIZE) != PPL_HEADER_SIZE) {
+		perror("Read PPL header failed");
+		ret = -1;
+		goto out;
+	}
+
+	ppl_hdr = buf;
+
+	crc = __le32_to_cpu(ppl_hdr->checksum);
+	ppl_hdr->checksum = 0;
+
+	if (crc != ~crc32c_le(~0, buf, PPL_HEADER_SIZE)) {
+		dprintf("Wrong PPL header checksum on %s\n",
+			d->devname);
+		ret = 1;
+	}
+
+	if (!ret && (__le32_to_cpu(ppl_hdr->signature) !=
+		      super->anchor->orig_family_num)) {
+		dprintf("Wrong PPL header signature on %s\n",
+			d->devname);
+		ret = 1;
+	}
+
+out:
+	free(buf);
+
+	if (ret == 1 && map->map_state == IMSM_T_STATE_UNINITIALIZED)
+		return st->ss->write_init_ppl(st, info, d->fd);
+
+	return ret;
+}
+
+#ifndef MDASSEMBLE
+
+static int write_init_ppl_imsm_all(struct supertype *st, struct mdinfo *info)
+{
+	struct intel_super *super = st->sb;
+	struct dl *d;
+	int ret = 0;
+
+	if (info->consistency_policy != CONSISTENCY_POLICY_PPL ||
+	    info->array.level != 5)
+		return 0;
+
+	for (d = super->disks; d ; d = d->next) {
+		if (d->index < 0 || is_failed(&d->disk))
+			continue;
+
+		ret = st->ss->write_init_ppl(st, info, d->fd);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
 
 static int write_init_super_imsm(struct supertype *st)
 {
 	struct intel_super *super = st->sb;
 	int current_vol = super->current_vol;
+	int rv = 0;
+	struct mdinfo info;
+
+	getinfo_super_imsm(st, &info, NULL);
 
 	/* we are done with current_vol reset it to point st at the container */
 	super->current_vol = -1;
@@ -5938,24 +6147,29 @@ static int write_init_super_imsm(struct supertype *st)
 	if (st->update_tail) {
 		/* queue the recently created array / added disk
 		 * as a metadata update */
-		int rv;
 
 		/* determine if we are creating a volume or adding a disk */
 		if (current_vol < 0) {
 			/* in the mgmt (add/remove) disk case we are running
 			 * in mdmon context, so don't close fd's
 			 */
-			return mgmt_disk(st);
-		} else
-			rv = create_array(st, current_vol);
-
-		return rv;
+			rv = mgmt_disk(st);
+		} else {
+			rv = write_init_ppl_imsm_all(st, &info);
+			if (!rv)
+				rv = create_array(st, current_vol);
+		}
 	} else {
 		struct dl *d;
 		for (d = super->disks; d; d = d->next)
 			Kill(d->devname, NULL, 0, -1, 1);
-		return write_super_imsm(st, 1);
+		if (current_vol >= 0)
+			rv = write_init_ppl_imsm_all(st, &info);
+		if (!rv)
+			rv = write_super_imsm(st, 1);
 	}
+
+	return rv;
 }
 #endif
 
@@ -7372,7 +7586,8 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
 			 *
 			 * FIXME handle dirty degraded
 			 */
-			if ((skip || recovery_start == 0) && !dev->vol.dirty)
+			if ((skip || recovery_start == 0) &&
+			    !(dev->vol.dirty & RAIDVOL_DIRTY))
 				this->resync_start = MaxSector;
 			if (skip)
 				continue;
@@ -7407,9 +7622,12 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
 				info_d->component_size =
 						num_data_stripes(map) *
 						map->blocks_per_strip;
+				info_d->ppl_sector = this->ppl_sector;
+				info_d->ppl_size = this->ppl_size;
 			} else {
 				info_d->component_size = blocks_per_member(map);
 			}
+			info_d->consistency_policy = this->consistency_policy;
 
 			info_d->bb.supported = 1;
 			get_volume_badblocks(super->bbm_log, ord_to_idx(ord),
@@ -7925,12 +8143,16 @@ mark_checkpoint:
 
 skip_mark_checkpoint:
 	/* mark dirty / clean */
-	if (dev->vol.dirty != !consistent) {
+	if (((dev->vol.dirty & RAIDVOL_DIRTY) && consistent) ||
+	    (!(dev->vol.dirty & RAIDVOL_DIRTY) && !consistent)) {
 		dprintf("imsm: mark '%s'\n", consistent ? "clean" : "dirty");
-		if (consistent)
-			dev->vol.dirty = 0;
-		else
-			dev->vol.dirty = 1;
+		if (consistent) {
+			dev->vol.dirty = RAIDVOL_CLEAN;
+		} else {
+			dev->vol.dirty = RAIDVOL_DIRTY;
+			if (dev->rwh_policy == RWH_DISTRIBUTED)
+				dev->vol.dirty |= RAIDVOL_DSRECORD_VALID;
+		}
 		super->updates_pending++;
 	}
 
@@ -8442,6 +8664,11 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,
 		di->component_size = a->info.component_size;
 		di->container_member = inst;
 		di->bb.supported = 1;
+		if (dev->rwh_policy == RWH_DISTRIBUTED) {
+			di->consistency_policy = CONSISTENCY_POLICY_PPL;
+			di->ppl_sector = get_ppl_sector(super, inst);
+			di->ppl_size = (PPL_HEADER_SIZE + PPL_ENTRY_SPACE) >> 9;
+		}
 		super->random = random32();
 		di->next = rv;
 		rv = di;
@@ -11597,6 +11824,9 @@ struct superswitch super_imsm = {
 	.container_content = container_content_imsm,
 	.validate_container = validate_container_imsm,
 
+	.write_init_ppl = write_init_ppl_imsm,
+	.validate_ppl	= validate_ppl_imsm,
+
 	.external	= 1,
 	.name = "imsm",
 
diff --git a/sysfs.c b/sysfs.c
index 53589a76..2a91ba0a 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -689,6 +689,16 @@ int sysfs_set_array(struct mdinfo *info, int vers)
 		 * once the reshape completes.
 		 */
 	}
+
+	if (info->consistency_policy == CONSISTENCY_POLICY_PPL) {
+		if (sysfs_set_str(info, NULL, "consistency_policy",
+				  map_num(consistency_policies,
+					  info->consistency_policy))) {
+			pr_err("This kernel does not support PPL\n");
+			return 1;
+		}
+	}
+
 	return rv;
 }
 
@@ -720,6 +730,10 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
 	rv = sysfs_set_num(sra, sd, "offset", sd->data_offset);
 	rv |= sysfs_set_num(sra, sd, "size", (sd->component_size+1) / 2);
 	if (sra->array.level != LEVEL_CONTAINER) {
+		if (sd->consistency_policy == CONSISTENCY_POLICY_PPL) {
+			rv |= sysfs_set_num(sra, sd, "ppl_sector", sd->ppl_sector);
+			rv |= sysfs_set_num(sra, sd, "ppl_size", sd->ppl_size);
+		}
 		if (sd->recovery_start == MaxSector)
 			/* This can correctly fail if array isn't started,
 			 * yet, so just ignore status for now.
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 2/6] Detail: show consistency policy
From: Artur Paszkiewicz @ 2017-03-16 21:09 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170316210948.21093-1-artur.paszkiewicz@intel.com>

Show the currently enabled consistency policy in the output from
--detail. Add 3 spaces to all existing items in Detail output to align
with "Consistency Policy : ".

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Detail.c      | 90 +++++++++++++++++++++++++++++++++++------------------------
 super-ddf.c   |  6 ++--
 super-intel.c |  2 +-
 super0.c      |  4 +--
 super1.c      |  9 +++---
 5 files changed, 65 insertions(+), 46 deletions(-)

diff --git a/Detail.c b/Detail.c
index 509b0d41..3067fb69 100644
--- a/Detail.c
+++ b/Detail.c
@@ -394,24 +394,25 @@ int Detail(char *dev, struct context *c)
 		printf("%s:\n", dev);
 
 		if (container)
-			printf("      Container : %s, member %s\n", container, member);
+			printf("         Container : %s, member %s\n", container,
+			       member);
 		else {
 		if (sra && sra->array.major_version < 0)
-			printf("        Version : %s\n", sra->text_version);
+			printf("           Version : %s\n", sra->text_version);
 		else
-			printf("        Version : %d.%d\n",
+			printf("           Version : %d.%d\n",
 			       array.major_version, array.minor_version);
 		}
 
 		atime = array.ctime;
 		if (atime)
-			printf("  Creation Time : %.24s\n", ctime(&atime));
+			printf("     Creation Time : %.24s\n", ctime(&atime));
 		if (array.raid_disks == 0 && external)
 			str = "container";
 		if (str)
-			printf("     Raid Level : %s\n", str);
+			printf("        Raid Level : %s\n", str);
 		if (larray_size)
-			printf("     Array Size : %llu%s\n", (larray_size>>10),
+			printf("        Array Size : %llu%s\n", (larray_size>>10),
 			       human_size(larray_size));
 		if (array.level >= 1) {
 			if (sra)
@@ -420,38 +421,38 @@ int Detail(char *dev, struct context *c)
 			    (larray_size >= 0xFFFFFFFFULL|| array.size == 0)) {
 				unsigned long long dsize = get_component_size(fd);
 				if (dsize > 0)
-					printf("  Used Dev Size : %llu%s\n",
+					printf("     Used Dev Size : %llu%s\n",
 					       dsize/2,
 					 human_size((long long)dsize<<9));
 				else
-					printf("  Used Dev Size : unknown\n");
+					printf("     Used Dev Size : unknown\n");
 			} else
-				printf("  Used Dev Size : %lu%s\n",
+				printf("     Used Dev Size : %lu%s\n",
 				       (unsigned long)array.size,
 				       human_size((unsigned long long)array.size<<10));
 		}
 		if (array.raid_disks)
-			printf("   Raid Devices : %d\n", array.raid_disks);
-		printf("  Total Devices : %d\n", array.nr_disks);
+			printf("      Raid Devices : %d\n", array.raid_disks);
+		printf("     Total Devices : %d\n", array.nr_disks);
 		if (!container &&
 		    ((sra == NULL && array.major_version == 0) ||
 		     (sra && sra->array.major_version == 0)))
-			printf("Preferred Minor : %d\n", array.md_minor);
+			printf("   Preferred Minor : %d\n", array.md_minor);
 		if (sra == NULL || sra->array.major_version >= 0)
-			printf("    Persistence : Superblock is %spersistent\n",
+			printf("       Persistence : Superblock is %spersistent\n",
 			       array.not_persistent?"not ":"");
 		printf("\n");
 		/* Only try GET_BITMAP_FILE for 0.90.01 and later */
 		if (vers >= 9001 &&
 		    ioctl(fd, GET_BITMAP_FILE, &bmf) == 0 &&
 		    bmf.pathname[0]) {
-			printf("  Intent Bitmap : %s\n", bmf.pathname);
+			printf("     Intent Bitmap : %s\n", bmf.pathname);
 			printf("\n");
 		} else if (array.state & (1<<MD_SB_BITMAP_PRESENT))
-			printf("  Intent Bitmap : Internal\n\n");
+			printf("     Intent Bitmap : Internal\n\n");
 		atime = array.utime;
 		if (atime)
-			printf("    Update Time : %.24s\n", ctime(&atime));
+			printf("       Update Time : %.24s\n", ctime(&atime));
 		if (array.raid_disks) {
 			static char *sync_action[] = {
 				", recovering",  ", resyncing",
@@ -465,7 +466,7 @@ int Detail(char *dev, struct context *c)
 			else
 				st = ", degraded";
 
-			printf("          State : %s%s%s%s%s%s \n",
+			printf("             State : %s%s%s%s%s%s \n",
 			       (array.state&(1<<MD_SB_CLEAN))?"clean":"active", st,
 			       (!e || (e->percent < 0 && e->percent != RESYNC_PENDING &&
 			       e->percent != RESYNC_DELAYED)) ? "" : sync_action[e->resync],
@@ -473,27 +474,27 @@ int Detail(char *dev, struct context *c)
 			       (e && e->percent == RESYNC_DELAYED) ? " (DELAYED)": "",
 			       (e && e->percent == RESYNC_PENDING) ? " (PENDING)": "");
 		} else if (inactive) {
-			printf("          State : inactive\n");
+			printf("             State : inactive\n");
 		}
 		if (array.raid_disks)
-			printf(" Active Devices : %d\n", array.active_disks);
+			printf("    Active Devices : %d\n", array.active_disks);
 		if (array.working_disks > 0)
-			printf("Working Devices : %d\n", array.working_disks);
+			printf("   Working Devices : %d\n", array.working_disks);
 		if (array.raid_disks) {
-			printf(" Failed Devices : %d\n", array.failed_disks);
-			printf("  Spare Devices : %d\n", array.spare_disks);
+			printf("    Failed Devices : %d\n", array.failed_disks);
+			printf("     Spare Devices : %d\n", array.spare_disks);
 		}
 		printf("\n");
 		if (array.level == 5) {
 			str = map_num(r5layout, array.layout);
-			printf("         Layout : %s\n", str?str:"-unknown-");
+			printf("            Layout : %s\n", str?str:"-unknown-");
 		}
 		if (array.level == 6) {
 			str = map_num(r6layout, array.layout);
-			printf("         Layout : %s\n", str?str:"-unknown-");
+			printf("            Layout : %s\n", str?str:"-unknown-");
 		}
 		if (array.level == 10) {
-			printf("         Layout :");
+			printf("            Layout :");
 			print_r10_layout(array.layout);
 			printf("\n");
 		}
@@ -504,20 +505,35 @@ int Detail(char *dev, struct context *c)
 		case 10:
 		case 6:
 			if (array.chunk_size)
-				printf("     Chunk Size : %dK\n\n",
+				printf("        Chunk Size : %dK\n\n",
 				       array.chunk_size/1024);
 			break;
 		case -1:
-			printf("       Rounding : %dK\n\n", array.chunk_size/1024);
+			printf("          Rounding : %dK\n\n",
+			       array.chunk_size/1024);
 			break;
 		default: break;
 		}
 
+		if (array.raid_disks) {
+			struct mdinfo *mdi = sysfs_read(fd, NULL,
+							GET_CONSISTENCY_POLICY);
+			if (mdi) {
+				char *policy = map_num(consistency_policies,
+						       mdi->consistency_policy);
+				sysfs_free(mdi);
+				if (policy)
+					printf("Consistency Policy : %s\n\n",
+					       policy);
+			}
+		}
+
 		if (e && e->percent >= 0) {
 			static char *sync_action[] = {
 				"Rebuild", "Resync",
 				"Reshape", "Check"};
-			printf(" %7s Status : %d%% complete\n", sync_action[e->resync], e->percent);
+			printf("    %7s Status : %d%% complete\n",
+			       sync_action[e->resync], e->percent);
 			is_rebuilding = 1;
 		}
 		free_mdstat(ms);
@@ -525,39 +541,41 @@ int Detail(char *dev, struct context *c)
 		if ((st && st->sb) && (info && info->reshape_active)) {
 #if 0
 This is pretty boring
-			printf("  Reshape pos'n : %llu%s\n", (unsigned long long) info->reshape_progress<<9,
+			printf("     Reshape pos'n : %llu%s\n",
+			       (unsigned long long) info->reshape_progress<<9,
 			       human_size((unsigned long long)info->reshape_progress<<9));
 #endif
 			if (info->delta_disks != 0)
-				printf("  Delta Devices : %d, (%d->%d)\n",
+				printf("     Delta Devices : %d, (%d->%d)\n",
 				       info->delta_disks,
 				       array.raid_disks - info->delta_disks,
 				       array.raid_disks);
 			if (info->new_level != array.level) {
 				str = map_num(pers, info->new_level);
-				printf("      New Level : %s\n", str?str:"-unknown-");
+				printf("         New Level : %s\n", str?str:"-unknown-");
 			}
 			if (info->new_level != array.level ||
 			    info->new_layout != array.layout) {
 				if (info->new_level == 5) {
 					str = map_num(r5layout, info->new_layout);
-					printf("     New Layout : %s\n",
+					printf("        New Layout : %s\n",
 					       str?str:"-unknown-");
 				}
 				if (info->new_level == 6) {
 					str = map_num(r6layout, info->new_layout);
-					printf("     New Layout : %s\n",
+					printf("        New Layout : %s\n",
 					       str?str:"-unknown-");
 				}
 				if (info->new_level == 10) {
-					printf("     New Layout : near=%d, %s=%d\n",
+					printf("        New Layout : near=%d, %s=%d\n",
 					       info->new_layout&255,
 					       (info->new_layout&0x10000)?"offset":"far",
 					       (info->new_layout>>8)&255);
 				}
 			}
 			if (info->new_chunk != array.chunk_size)
-				printf("  New Chunksize : %dK\n", info->new_chunk/1024);
+				printf("     New Chunksize : %dK\n",
+				       info->new_chunk/1024);
 			printf("\n");
 		} else if (e && e->percent >= 0)
 			printf("\n");
@@ -572,7 +590,7 @@ This is pretty boring
 			DIR *dir = opendir("/sys/block");
 			struct dirent *de;
 
-			printf("  Member Arrays :");
+			printf("     Member Arrays :");
 
 			while (dir && (de = readdir(dir)) != NULL) {
 				char path[200];
diff --git a/super-ddf.c b/super-ddf.c
index cdd16a47..c6037c1c 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -1742,10 +1742,10 @@ static void detail_super_ddf(struct supertype *st, char *homehost)
 	struct ddf_super *sb = st->sb;
 	int cnt = be16_to_cpu(sb->virt->populated_vdes);
 
-	printf(" Container GUID : "); print_guid(sb->anchor.guid, 1);
+	printf("    Container GUID : "); print_guid(sb->anchor.guid, 1);
 	printf("\n");
-	printf("            Seq : %08x\n", be32_to_cpu(sb->active->seq));
-	printf("  Virtual Disks : %d\n", cnt);
+	printf("               Seq : %08x\n", be32_to_cpu(sb->active->seq));
+	printf("     Virtual Disks : %d\n", cnt);
 	printf("\n");
 }
 #endif
diff --git a/super-intel.c b/super-intel.c
index 6d16a191..120ce77c 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -1986,7 +1986,7 @@ static void detail_super_imsm(struct supertype *st, char *homehost)
 
 	getinfo_super_imsm(st, &info, NULL);
 	fname_from_uuid(st, &info, nbuf, ':');
-	printf("\n           UUID : %s\n", nbuf + 5);
+	printf("\n              UUID : %s\n", nbuf + 5);
 }
 
 static void brief_detail_super_imsm(struct supertype *st)
diff --git a/super0.c b/super0.c
index 0fec96b7..b2cdaec5 100644
--- a/super0.c
+++ b/super0.c
@@ -353,7 +353,7 @@ err:
 static void detail_super0(struct supertype *st, char *homehost)
 {
 	mdp_super_t *sb = st->sb;
-	printf("           UUID : ");
+	printf("              UUID : ");
 	if (sb->minor_version >= 90)
 		printf("%08x:%08x:%08x:%08x", sb->set_uuid0, sb->set_uuid1,
 		       sb->set_uuid2, sb->set_uuid3);
@@ -367,7 +367,7 @@ static void detail_super0(struct supertype *st, char *homehost)
 		if (memcmp(&sb->set_uuid2, hash, 8)==0)
 			printf(" (local to host %s)", homehost);
 	}
-	printf("\n         Events : %d.%d\n\n", sb->events_hi, sb->events_lo);
+	printf("\n            Events : %d.%d\n\n", sb->events_hi, sb->events_lo);
 }
 
 static void brief_detail_super0(struct supertype *st)
diff --git a/super1.c b/super1.c
index fa238329..672cdde6 100644
--- a/super1.c
+++ b/super1.c
@@ -780,19 +780,20 @@ static void detail_super1(struct supertype *st, char *homehost)
 	int i;
 	int l = homehost ? strlen(homehost) : 0;
 
-	printf("           Name : %.32s", sb->set_name);
+	printf("              Name : %.32s", sb->set_name);
 	if (l > 0 && l < 32 &&
 	    sb->set_name[l] == ':' &&
 	    strncmp(sb->set_name, homehost, l) == 0)
 		printf("  (local to host %s)", homehost);
 	if (bms->nodes > 0 && (__le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET))
-	    printf("\n   Cluster Name : %-64s", bms->cluster_name);
-	printf("\n           UUID : ");
+	    printf("\n      Cluster Name : %-64s", bms->cluster_name);
+	printf("\n              UUID : ");
 	for (i=0; i<16; i++) {
 		if ((i&3)==0 && i != 0) printf(":");
 		printf("%02x", sb->set_uuid[i]);
 	}
-	printf("\n         Events : %llu\n\n", (unsigned long long)__le64_to_cpu(sb->events));
+	printf("\n            Events : %llu\n\n",
+	       (unsigned long long)__le64_to_cpu(sb->events));
 }
 
 static void brief_detail_super1(struct supertype *st)
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 1/6] Generic support for --consistency-policy and PPL
From: Artur Paszkiewicz @ 2017-03-16 21:09 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170316210948.21093-1-artur.paszkiewicz@intel.com>

Add a new parameter to mdadm: --consistency-policy=. It determines how
the array maintains consistency in case of unexpected shutdown. This
maps to the md sysfs attribute 'consistency_policy'. It can be used to
create a raid5 array using PPL. Add the necessary plumbing to pass this
option to metadata handlers. The write journal and bitmap
functionalities are treated as different policies, which are implicitly
selected when using --write-journal or --bitmap options.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Create.c      | 18 ++++++++++++++----
 Kill.c        |  2 +-
 ReadMe.c      |  7 ++++---
 maps.c        | 10 ++++++++++
 mdadm.8.in    | 40 +++++++++++++++++++++++++++++++++++++---
 mdadm.c       | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 mdadm.h       | 21 ++++++++++++++++++---
 super-ddf.c   |  6 +++---
 super-gpt.c   |  2 +-
 super-intel.c | 16 ++++++++--------
 super-mbr.c   |  2 +-
 super0.c      |  8 ++++----
 super1.c      |  6 +++---
 sysfs.c       | 11 +++++++++++
 14 files changed, 167 insertions(+), 37 deletions(-)

diff --git a/Create.c b/Create.c
index 2721884e..4080bf69 100644
--- a/Create.c
+++ b/Create.c
@@ -259,7 +259,8 @@ int Create(struct supertype *st, char *mddev,
 	if (st && ! st->ss->validate_geometry(st, s->level, s->layout, s->raiddisks,
 					      &s->chunk, s->size*2,
 					      data_offset, NULL,
-					      &newsize, c->verbose>=0))
+					      &newsize, s->consistency_policy,
+					      c->verbose>=0))
 		return 1;
 
 	if (s->chunk && s->chunk != UnSet) {
@@ -358,7 +359,8 @@ int Create(struct supertype *st, char *mddev,
 						st, s->level, s->layout, s->raiddisks,
 						&s->chunk, s->size*2,
 						dv->data_offset, dname,
-						&freesize, c->verbose > 0)) {
+						&freesize, s->consistency_policy,
+						c->verbose > 0)) {
 				case -1: /* Not valid, message printed, and not
 					  * worth checking any further */
 					exit(2);
@@ -395,6 +397,7 @@ int Create(struct supertype *st, char *mddev,
 						       &s->chunk, s->size*2,
 						       dv->data_offset,
 						       dname, &freesize,
+						       s->consistency_policy,
 						       c->verbose >= 0)) {
 
 				pr_err("%s is not suitable for this array.\n",
@@ -501,7 +504,8 @@ int Create(struct supertype *st, char *mddev,
 						       s->raiddisks,
 						       &s->chunk, minsize*2,
 						       data_offset,
-						       NULL, NULL, 0)) {
+						       NULL, NULL,
+						       s->consistency_policy, 0)) {
 				pr_err("devices too large for RAID level %d\n", s->level);
 				return 1;
 			}
@@ -528,6 +532,12 @@ int Create(struct supertype *st, char *mddev,
 	if (s->bitmap_file && strcmp(s->bitmap_file, "none") == 0)
 		s->bitmap_file = NULL;
 
+	if (s->consistency_policy == CONSISTENCY_POLICY_PPL &&
+	    !st->ss->write_init_ppl) {
+		pr_err("%s metadata does not support PPL\n", st->ss->name);
+		return 1;
+	}
+
 	if (!have_container && s->level > 0 && ((maxsize-s->size)*100 > maxsize)) {
 		if (c->runstop != 1 || c->verbose >= 0)
 			pr_err("largest drive (%s) exceeds size (%lluK) by more than 1%%\n",
@@ -720,7 +730,7 @@ int Create(struct supertype *st, char *mddev,
 				name += 2;
 		}
 	}
-	if (!st->ss->init_super(st, &info.array, s->size, name, c->homehost, uuid,
+	if (!st->ss->init_super(st, &info.array, s, name, c->homehost, uuid,
 				data_offset))
 		goto abort_locked;
 
diff --git a/Kill.c b/Kill.c
index f2fdb856..ff52561d 100644
--- a/Kill.c
+++ b/Kill.c
@@ -63,7 +63,7 @@ int Kill(char *dev, struct supertype *st, int force, int verbose, int noexcl)
 	rv = st->ss->load_super(st, fd, dev);
 	if (rv == 0 || (force && rv >= 2)) {
 		st->ss->free_super(st);
-		st->ss->init_super(st, NULL, 0, "", NULL, NULL,
+		st->ss->init_super(st, NULL, NULL, "", NULL, NULL,
 				   INVALID_SECTORS);
 		if (st->ss->store_super(st, fd)) {
 			if (verbose >= 0)
diff --git a/ReadMe.c b/ReadMe.c
index 50d38076..a2acfc74 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -78,11 +78,11 @@ char Version[] = "mdadm - v" VERSION " - " VERS_DATE "\n";
  *     found, it is started.
  */
 
-char short_options[]="-ABCDEFGIQhVXYWZ:vqbc:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:";
+char short_options[]="-ABCDEFGIQhVXYWZ:vqbc:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:k:";
 char short_bitmap_options[]=
-		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:";
+		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sarfRSow1tye:k:";
 char short_bitmap_auto_options[]=
-		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sa:rfRSow1tye:";
+		"-ABCDEFGIQhVXYWZ:vqb:c:i:l:p:m:n:x:u:c:d:z:U:N:sa:rfRSow1tye:k:";
 
 struct option long_options[] = {
     {"manage",    0, 0, ManageOpt},
@@ -148,6 +148,7 @@ struct option long_options[] = {
     {"nodes",1, 0, Nodes}, /* also for --assemble */
     {"home-cluster",1, 0, ClusterName},
     {"write-journal",1, 0, WriteJournal},
+    {"consistency-policy",1, 0, 'k'},
 
     /* For assemble */
     {"uuid",      1, 0, 'u'},
diff --git a/maps.c b/maps.c
index 64f1df2c..d9ee7de4 100644
--- a/maps.c
+++ b/maps.c
@@ -129,6 +129,16 @@ mapping_t faultylayout[] = {
 	{ NULL, 0}
 };
 
+mapping_t consistency_policies[] = {
+	{ "unknown", CONSISTENCY_POLICY_UNKNOWN},
+	{ "none", CONSISTENCY_POLICY_NONE},
+	{ "resync", CONSISTENCY_POLICY_RESYNC},
+	{ "bitmap", CONSISTENCY_POLICY_BITMAP},
+	{ "journal", CONSISTENCY_POLICY_JOURNAL},
+	{ "ppl", CONSISTENCY_POLICY_PPL},
+	{ NULL, 0}
+};
+
 char *map_num(mapping_t *map, int num)
 {
 	while (map->name) {
diff --git a/mdadm.8.in b/mdadm.8.in
index df1d460d..cad5db53 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -724,7 +724,9 @@ When creating an array on devices which are 100G or larger,
 .I mdadm
 automatically adds an internal bitmap as it will usually be
 beneficial.  This can be suppressed with
-.B "\-\-bitmap=none".
+.B "\-\-bitmap=none"
+or by selecting a different consistency policy with
+.BR \-\-consistency\-policy .
 
 .TP
 .BR \-\-bitmap\-chunk=
@@ -1020,6 +1022,36 @@ should be a SSD with reasonable lifetime.
 Auto creation of symlinks in /dev to /dev/md, option --symlinks must
 be 'no' or 'yes' and work with --create and --build.
 
+.TP
+.BR \-k ", " \-\-consistency\-policy=
+Specify how the array maintains consistency in case of unexpected shutdown.
+Only relevant for RAID levels with redundancy.
+Currently supported options are:
+.RS
+
+.TP
+.B resync
+Full resync is performed and all redundancy is regenerated when the array is
+started after unclean shutdown.
+
+.TP
+.B bitmap
+Resync assisted by a write-intent bitmap. Implicitly selected when using
+.BR \-\-bitmap .
+
+.TP
+.B journal
+For RAID levels 4/5/6, journal device is used to log transactions and replay
+after unclean shutdown. Implicitly selected when using
+.BR \-\-write\-journal .
+
+.TP
+.B ppl
+For RAID5 only, Partial Parity Log is used to close the write hole and
+eliminate resync. PPL is stored in the metadata region of RAID member drives,
+no additional journal drive is needed.
+.RE
+
 
 .SH For assemble:
 
@@ -2153,8 +2185,10 @@ in the array exceed 100G is size, an internal write-intent bitmap
 will automatically be added unless some other option is explicitly
 requested with the
 .B \-\-bitmap
-option.  In any case space for a bitmap will be reserved so that one
-can be added layer with
+option or a different consistency policy is selected with the
+.B \-\-consistency\-policy
+option. In any case space for a bitmap will be reserved so that one
+can be added later with
 .BR "\-\-grow \-\-bitmap=internal" .
 
 If the metadata type supports it (currently only 1.x metadata), space
diff --git a/mdadm.c b/mdadm.c
index d6ad8dc2..65431b76 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -78,6 +78,7 @@ int main(int argc, char *argv[])
 		.level		= UnSet,
 		.layout		= UnSet,
 		.bitmap_chunk	= UnSet,
+		.consistency_policy	= UnSet,
 	};
 
 	char sys_hostname[256];
@@ -1211,6 +1212,16 @@ int main(int argc, char *argv[])
 
 			s.journaldisks = 1;
 			continue;
+		case O(CREATE, 'k'):
+			s.consistency_policy = map_name(consistency_policies,
+							optarg);
+			if (s.consistency_policy == UnSet ||
+			    s.consistency_policy < CONSISTENCY_POLICY_RESYNC) {
+				pr_err("Invalid consistency policy: %s\n",
+				       optarg);
+				exit(2);
+			}
+			continue;
 		}
 		/* We have now processed all the valid options. Anything else is
 		 * an error
@@ -1238,9 +1249,47 @@ int main(int argc, char *argv[])
 		exit(0);
 	}
 
-	if (s.journaldisks && (s.level < 4 || s.level > 6)) {
-		pr_err("--write-journal is only supported for RAID level 4/5/6.\n");
-		exit(2);
+	if (s.journaldisks) {
+		if (s.level < 4 || s.level > 6) {
+			pr_err("--write-journal is only supported for RAID level 4/5/6.\n");
+			exit(2);
+		}
+		if (s.consistency_policy != UnSet &&
+		    s.consistency_policy != CONSISTENCY_POLICY_JOURNAL) {
+			pr_err("--write-journal is not supported with consistency policy: %s\n",
+			       map_num(consistency_policies, s.consistency_policy));
+			exit(2);
+		}
+	}
+
+	if (mode == CREATE && s.consistency_policy != UnSet) {
+		if (s.level <= 0) {
+			pr_err("--consistency-policy not meaningful with level %s.\n",
+			       map_num(pers, s.level));
+			exit(2);
+		} else if (s.consistency_policy == CONSISTENCY_POLICY_JOURNAL &&
+			   !s.journaldisks) {
+			pr_err("--write-journal is required for consistency policy: %s\n",
+			       map_num(consistency_policies, s.consistency_policy));
+			exit(2);
+		} else if (s.consistency_policy == CONSISTENCY_POLICY_PPL &&
+			   s.level != 5) {
+			pr_err("PPL consistency policy is only supported for RAID level 5.\n");
+			exit(2);
+		} else if (s.consistency_policy == CONSISTENCY_POLICY_BITMAP &&
+			   (!s.bitmap_file ||
+			    strcmp(s.bitmap_file, "none") == 0)) {
+			pr_err("--bitmap is required for consistency policy: %s\n",
+			       map_num(consistency_policies, s.consistency_policy));
+			exit(2);
+		} else if (s.bitmap_file &&
+			   strcmp(s.bitmap_file, "none") != 0 &&
+			   s.consistency_policy != CONSISTENCY_POLICY_BITMAP &&
+			   s.consistency_policy != CONSISTENCY_POLICY_JOURNAL) {
+			pr_err("--bitmap is not compatible with consistency policy: %s\n",
+			       map_num(consistency_policies, s.consistency_policy));
+			exit(2);
+		}
 	}
 
 	if (!mode && devs_found) {
diff --git a/mdadm.h b/mdadm.h
index 71b8afb9..ed4d7e4e 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -279,6 +279,15 @@ struct mdinfo {
 	int journal_device_required;
 	int journal_clean;
 
+	enum {
+		CONSISTENCY_POLICY_UNKNOWN,
+		CONSISTENCY_POLICY_NONE,
+		CONSISTENCY_POLICY_RESYNC,
+		CONSISTENCY_POLICY_BITMAP,
+		CONSISTENCY_POLICY_JOURNAL,
+		CONSISTENCY_POLICY_PPL,
+	} consistency_policy;
+
 	/* During reshape we can sometimes change the data_offset to avoid
 	 * over-writing still-valid data.  We need to know if there is space.
 	 * So getinfo_super will fill in space_before and space_after in sectors.
@@ -426,6 +435,7 @@ enum special_options {
 	ClusterName,
 	ClusterConfirm,
 	WriteJournal,
+	ConsistencyPolicy,
 };
 
 enum prefix_standard {
@@ -527,6 +537,7 @@ struct shape {
 	int	assume_clean;
 	int	write_behind;
 	unsigned long long size;
+	int	consistency_policy;
 };
 
 /* List of device names - wildcards expanded */
@@ -618,6 +629,7 @@ enum sysfs_read_flags {
 	GET_STATE	= (1 << 23),
 	GET_ERROR	= (1 << 24),
 	GET_ARRAY_STATE = (1 << 25),
+	GET_CONSISTENCY_POLICY	= (1 << 26),
 };
 
 /* If fd >= 0, get the array it is open on,
@@ -701,7 +713,7 @@ extern int restore_stripes(int *dest, unsigned long long *offsets,
 
 extern char *map_num(mapping_t *map, int num);
 extern int map_name(mapping_t *map, char *name);
-extern mapping_t r5layout[], r6layout[], pers[], modes[], faultylayout[];
+extern mapping_t r5layout[], r6layout[], pers[], modes[], faultylayout[], consistency_policies[];
 
 extern char *map_dev_preferred(int major, int minor, int create,
 			       char *prefer);
@@ -863,7 +875,7 @@ extern struct superswitch {
 	 * metadata.
 	 */
 	int (*init_super)(struct supertype *st, mdu_array_info_t *info,
-			  unsigned long long size, char *name,
+			  struct shape *s, char *name,
 			  char *homehost, int *uuid,
 			  unsigned long long data_offset);
 
@@ -961,7 +973,7 @@ extern struct superswitch {
 				 int *chunk, unsigned long long size,
 				 unsigned long long data_offset,
 				 char *subdev, unsigned long long *freesize,
-				 int verbose);
+				 int consistency_policy, int verbose);
 
 	/* Return a linked list of 'mdinfo' structures for all arrays
 	 * in the container.  For non-containers, it is like
@@ -1059,6 +1071,9 @@ extern struct superswitch {
 	/* validate container after assemble */
 	int (*validate_container)(struct mdinfo *info);
 
+	/* write initial empty PPL on device */
+	int (*write_init_ppl)(struct supertype *st, struct mdinfo *info, int fd);
+
 	/* records new bad block in metadata */
 	int (*record_bad_block)(struct active_array *a, int n,
 					unsigned long long sector, int length);
diff --git a/super-ddf.c b/super-ddf.c
index 1707ad1e..cdd16a47 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -2290,7 +2290,7 @@ static unsigned int find_vde_by_guid(const struct ddf_super *ddf,
 
 static int init_super_ddf(struct supertype *st,
 			  mdu_array_info_t *info,
-			  unsigned long long size, char *name, char *homehost,
+			  struct shape *s, char *name, char *homehost,
 			  int *uuid, unsigned long long data_offset)
 {
 	/* This is primarily called by Create when creating a new array.
@@ -2328,7 +2328,7 @@ static int init_super_ddf(struct supertype *st,
 	struct virtual_disk *vd;
 
 	if (st->sb)
-		return init_super_ddf_bvd(st, info, size, name, homehost, uuid,
+		return init_super_ddf_bvd(st, info, s->size, name, homehost, uuid,
 					  data_offset);
 
 	if (posix_memalign((void**)&ddf, 512, sizeof(*ddf)) != 0) {
@@ -3347,7 +3347,7 @@ static int validate_geometry_ddf(struct supertype *st,
 				 int *chunk, unsigned long long size,
 				 unsigned long long data_offset,
 				 char *dev, unsigned long long *freesize,
-				 int verbose)
+				 int consistency_policy, int verbose)
 {
 	int fd;
 	struct mdinfo *sra;
diff --git a/super-gpt.c b/super-gpt.c
index 8b080a05..bb38a975 100644
--- a/super-gpt.c
+++ b/super-gpt.c
@@ -205,7 +205,7 @@ static int validate_geometry(struct supertype *st, int level,
 			     int *chunk, unsigned long long size,
 			     unsigned long long data_offset,
 			     char *subdev, unsigned long long *freesize,
-			     int verbose)
+			     int consistency_policy, int verbose)
 {
 	pr_err("gpt metadata cannot be used this way\n");
 	return 0;
diff --git a/super-intel.c b/super-intel.c
index d5e95179..6d16a191 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -5154,7 +5154,7 @@ static int check_name(struct intel_super *super, char *name, int quiet)
 }
 
 static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
-				  unsigned long long size, char *name,
+				  struct shape *s, char *name,
 				  char *homehost, int *uuid,
 				  long long data_offset)
 {
@@ -5249,7 +5249,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	strncpy((char *) dev->volume, name, MAX_RAID_SERIAL_LEN);
 	array_blocks = calc_array_size(info->level, info->raid_disks,
 					       info->layout, info->chunk_size,
-					       size * 2);
+					       s->size * 2);
 	/* round array size down to closest MB */
 	array_blocks = (array_blocks >> SECT_PER_MB_SHIFT) << SECT_PER_MB_SHIFT;
 
@@ -5263,7 +5263,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 	vol->curr_migr_unit = 0;
 	map = get_imsm_map(dev, MAP_0);
 	set_pba_of_lba0(map, super->create_offset);
-	set_blocks_per_member(map, info_to_blocks_per_member(info, size));
+	set_blocks_per_member(map, info_to_blocks_per_member(info, s->size));
 	map->blocks_per_strip = __cpu_to_le16(info_to_blocks_per_strip(info));
 	map->failed_disk_num = ~0;
 	if (info->level > 0)
@@ -5291,7 +5291,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 		map->num_domains = 1;
 
 	/* info->size is only int so use the 'size' parameter instead */
-	num_data_stripes = (size * 2) / info_to_blocks_per_strip(info);
+	num_data_stripes = (s->size * 2) / info_to_blocks_per_strip(info);
 	num_data_stripes /= map->num_domains;
 	set_num_data_stripes(map, num_data_stripes);
 
@@ -5313,7 +5313,7 @@ static int init_super_imsm_volume(struct supertype *st, mdu_array_info_t *info,
 }
 
 static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
-			   unsigned long long size, char *name,
+		           struct shape *s, char *name,
 			   char *homehost, int *uuid,
 			   unsigned long long data_offset)
 {
@@ -5336,7 +5336,7 @@ static int init_super_imsm(struct supertype *st, mdu_array_info_t *info,
 	}
 
 	if (st->sb)
-		return init_super_imsm_volume(st, info, size, name, homehost, uuid,
+		return init_super_imsm_volume(st, info, s, name, homehost, uuid,
 					      data_offset);
 
 	if (info)
@@ -6913,7 +6913,7 @@ static int validate_geometry_imsm(struct supertype *st, int level, int layout,
 				  int raiddisks, int *chunk, unsigned long long size,
 				  unsigned long long data_offset,
 				  char *dev, unsigned long long *freesize,
-				  int verbose)
+				  int consistency_policy, int verbose)
 {
 	int fd, cfd;
 	struct mdinfo *sra;
@@ -10950,7 +10950,7 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
 				    geo->raid_disks + devNumChange,
 				    &chunk,
 				    geo->size, INVALID_SECTORS,
-				    0, 0, 1))
+				    0, 0, info.consistency_policy, 1))
 		change = -1;
 
 	if (check_devs) {
diff --git a/super-mbr.c b/super-mbr.c
index f5e4ceab..1bbe57a9 100644
--- a/super-mbr.c
+++ b/super-mbr.c
@@ -193,7 +193,7 @@ static int validate_geometry(struct supertype *st, int level,
 			     int *chunk, unsigned long long size,
 			     unsigned long long data_offset,
 			     char *subdev, unsigned long long *freesize,
-			     int verbose)
+			     int consistency_policy, int verbose)
 {
 	pr_err("mbr metadata cannot be used this way\n");
 	return 0;
diff --git a/super0.c b/super0.c
index 938cfd95..0fec96b7 100644
--- a/super0.c
+++ b/super0.c
@@ -725,7 +725,7 @@ static int update_super0(struct supertype *st, struct mdinfo *info,
  * We use the first 8 bytes (64bits) of the sha1 of the host name
  */
 static int init_super0(struct supertype *st, mdu_array_info_t *info,
-		       unsigned long long size, char *ignored_name,
+		       struct shape *s, char *ignored_name,
 		       char *homehost, int *uuid,
 		       unsigned long long data_offset)
 {
@@ -764,8 +764,8 @@ static int init_super0(struct supertype *st, mdu_array_info_t *info,
 	sb->gvalid_words = 0; /* ignored */
 	sb->ctime = time(0);
 	sb->level = info->level;
-	sb->size = size;
-	if (size != (unsigned long long)sb->size)
+	sb->size = s->size;
+	if (s->size != (unsigned long long)sb->size)
 		return 0;
 	sb->nr_disks = info->nr_disks;
 	sb->raid_disks = info->raid_disks;
@@ -1267,7 +1267,7 @@ static int validate_geometry0(struct supertype *st, int level,
 			      int *chunk, unsigned long long size,
 			      unsigned long long data_offset,
 			      char *subdev, unsigned long long *freesize,
-			      int verbose)
+			      int consistency_policy, int verbose)
 {
 	unsigned long long ldsize;
 	int fd;
diff --git a/super1.c b/super1.c
index 882cd61a..fa238329 100644
--- a/super1.c
+++ b/super1.c
@@ -1397,7 +1397,7 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 }
 
 static int init_super1(struct supertype *st, mdu_array_info_t *info,
-		       unsigned long long size, char *name, char *homehost,
+		       struct shape *s, char *name, char *homehost,
 		       int *uuid, unsigned long long data_offset)
 {
 	struct mdp_superblock_1 *sb;
@@ -1450,7 +1450,7 @@ static int init_super1(struct supertype *st, mdu_array_info_t *info,
 	sb->ctime = __cpu_to_le64((unsigned long long)time(0));
 	sb->level = __cpu_to_le32(info->level);
 	sb->layout = __cpu_to_le32(info->layout);
-	sb->size = __cpu_to_le64(size*2ULL);
+	sb->size = __cpu_to_le64(s->size*2ULL);
 	sb->chunksize = __cpu_to_le32(info->chunk_size>>9);
 	sb->raid_disks = __cpu_to_le32(info->raid_disks);
 
@@ -2485,7 +2485,7 @@ static int validate_geometry1(struct supertype *st, int level,
 			      int *chunk, unsigned long long size,
 			      unsigned long long data_offset,
 			      char *subdev, unsigned long long *freesize,
-			      int verbose)
+			      int consistency_policy, int verbose)
 {
 	unsigned long long ldsize, devsize;
 	int bmspace;
diff --git a/sysfs.c b/sysfs.c
index b0657a04..53589a76 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -242,6 +242,17 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
 	} else
 		sra->sysfs_array_state[0] = 0;
 
+	if (options & GET_CONSISTENCY_POLICY) {
+		strcpy(base, "consistency_policy");
+		if (load_sys(fname, buf, sizeof(buf))) {
+			sra->consistency_policy = CONSISTENCY_POLICY_UNKNOWN;
+		} else {
+			sra->consistency_policy = map_name(consistency_policies, buf);
+			if (sra->consistency_policy == UnSet)
+				sra->consistency_policy = CONSISTENCY_POLICY_UNKNOWN;
+		}
+	}
+
 	if (! (options & GET_DEVS))
 		return sra;
 
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 0/6] mdadm support for Partial Parity Log
From: Artur Paszkiewicz @ 2017-03-16 21:09 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Artur Paszkiewicz

This is the mdadm part of the PPL functionality. It adds a new parameter to
mdadm to allow selecting the consistency policy for an array and implements PPL
as one of the policies.

All of this is currently targeted and tested with imsm and super1 metadata
arrays. The kernel patches are available on Shaohua's md tree on the for-next
branch.

v3:
- Rebased to latest mdadm code.
- Replaced --rwh-policy with --consistency-policy.
- Adjusted to the kernel interface changes.
- Write the initial ppl header in mdadm when creating an array.
- Validate ppl when starting an imsm array. 
- Support for --update.
- Support also 1.0 metadata.
- Use Grow mode for enabing or disabling ppl for an active array.

v2:
- Rebased to latest mdadm code.
- Fixed runtime policy switching.
- Recognize IMSM PPL journal drive correctly.
- Removed bitfields from imsm_dev.
- Made dirty/consistent checks in imsm_set_array_state() more readable.
- Added comment about calculating sb_start.

Artur Paszkiewicz (6):
  Generic support for --consistency-policy and PPL
  Detail: show consistency policy
  imsm: PPL support
  super1: PPL support
  Add 'ppl' and 'no-ppl' options for --update=
  Grow: support consistency policy change

 Assemble.c    |  58 ++++++++++
 Create.c      |  20 +++-
 Detail.c      |  90 +++++++++------
 Grow.c        | 187 ++++++++++++++++++++++++++++++-
 Incremental.c |   3 +
 Kill.c        |   2 +-
 Makefile      |   5 +-
 ReadMe.c      |   7 +-
 maps.c        |  10 ++
 md_p.h        |  25 +++++
 mdadm.8.in    |  85 ++++++++++++--
 mdadm.c       |  64 ++++++++++-
 mdadm.h       |  30 ++++-
 super-ddf.c   |  12 +-
 super-gpt.c   |   2 +-
 super-intel.c | 347 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
 super-mbr.c   |   2 +-
 super0.c      |  12 +-
 super1.c      | 214 +++++++++++++++++++++++++++++++-----
 sysfs.c       |  25 +++++
 20 files changed, 1069 insertions(+), 131 deletions(-)

-- 
2.11.0


^ permalink raw reply

* RE: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t
From: Reshetova, Elena @ 2017-03-16 18:00 UTC (permalink / raw)
  To: James Bottomley, Michael Ellerman,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux1394-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-raid-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devel-tBiZLqfeLfOHmIFyCCdPziST3g8Odh+X@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	fcoe-devel-s9riP+hp16TNLxjTenLetw@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b@public.gmane.org,
	target-devel-u79uwXL29TZNg+MwTxZMZA
In-Reply-To: <1489503539.3214.17.camel-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org>

> On Tue, 2017-03-14 at 12:29 +0000, Reshetova, Elena wrote:
> > > Elena Reshetova <elena.reshetova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> writes:
> > >
> > > > refcount_t type and corresponding API should be
> > > > used instead of atomic_t when the variable is used as
> > > > a reference counter. This allows to avoid accidental
> > > > refcounter overflows that might lead to use-after-free
> > > > situations.
> > > >
> > > > Signed-off-by: Elena Reshetova <elena.reshetova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> > > > Signed-off-by: Hans Liljestrand <ishkamiel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > > Signed-off-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> > > > Signed-off-by: David Windsor <dwindsor-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > > ---
> > > >  drivers/md/md.c | 6 +++---
> > > >  drivers/md/md.h | 3 ++-
> > > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > >
> > > When booting linux-next (specifically 5be4921c9958ec) I'm seeing
> > > the
> > > backtrace below. I suspect this patch is just exposing an existing
> > > issue?
> >
> > Yes, we have actually been following this issue in the another
> > thread.
> > It looks like the object is re-used somehow, but I can't quite
> > understand how just by reading the code.
> > This was what I put into the previous thread:
> >
> > "The log below indicates that you are using your refcounter in a bit
> > weird way in mddev_find().
> > However, I can't find the place (just by reading the code) where you
> > would increment refcounter from zero (vs. setting it to one).
> > It looks like you either iterate over existing nodes (and increment
> > their counters, which should be >= 1 at the time of increment) or
> > create a new node, but then mddev_init() sets the counter to 1. "
> >
> > If you can help to understand what is going on with the object
> > creation/destruction, would be appreciated!
> >
> > Also Shaohua Li stopped this patch coming from his tree since the
> > issue was caught at that time, so we are not going to merge this
> > until we figure it out.
> 
> Asking on the correct list (dm-devel) would have got you the easy
> answer:  The refcount behind mddev->active is a genuine atomic.  It has
> refcount properties but only if the array fails to initialise (in that
> case, final put kills it).  Once it's added to the system as a gendisk,
> it cannot be freed until md_free().  Thus its ->active count can go to
> zero (when it becomes inactive; usually because of an unmount). On a
> simple allocation regardless of outcome, the last executed statement in
> md_alloc is mddev_put(): that destroys the device if we didn't manage
> to create it or returns 0 and adds an inactive device to the system
> which the user can get with mddev_find().

Thank you James for explaining this! I guess in this case, the conversion doesn't make sense. 
And sorry about not asking in a correct place: we are handling many similar patches now and while I try to reach the right audience using get_maintainer script, it doesn't always succeeds. 

Best Regards,
Elena.

> 
> James
> 

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* RE: linux-next: WARNING: CPU: 0 PID: 1 at lib/refcount.c:114 refcount_inc+0x37/0x40
From: Reshetova, Elena @ 2017-03-16 18:00 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Andrei Vagin, linux-raid@vger.kernel.org
In-Reply-To: <20170314163155.v4ppkwcfcmfwwdu7@kernel.org>


> On Mon, Mar 13, 2017 at 10:04:32AM +0000, Reshetova, Elena wrote:
> > > On Fri, Mar 10, 2017 at 12:01:06PM -0800, Andrei Vagin wrote:
> > > > Hello,
> > > >
> > > > We run CRIU tests for linux-next kernels and here is a new issue:
> > > >
> > > > All logs are here: https://api.travis-
> ci.org/jobs/209680974/log.txt?deansi=true
> > > > The kernel version is 4.11.0-rc1-next-20170310
> > >
> > > Thanks for the reporting. It caused by 731d126(drivers, md: convert
> > > mddev.active from atomic_t to refcount_t). It turns out the count doesn't
> match
> > > the refcount usage. I'll drop the patch temporarily.
> >
> > The log below indicates that you are using your refcounter in a bit weird way in
> mddev_find().
> > However, I can't find the place (just by reading the code) where you would
> increment refcounter from zero (vs. setting it to one).
> > It looks like you either iterate over existing nodes (and increment their counters,
> which should be >= 1 at the time of increment) or create a new node, but then
> mddev_init() sets the counter to 1.
> >
> > Do you somehow reuse the objects or?
> 
> Yes, we reuse the objects, so they are not typical refcounter. The other patch
> for stripe->count probably has the same issue, as we will reuse the stripe even
> its count equals to 0, I guess that doesn't fit into refcount too.

I guess the only option for conversion in this case is to do global +1 on the whole refcounting scheme. 
We have done such changes in past to similar places. Do you think it would make sense for these patches? 
I can give it a try.

Best Regards,
Elena.

^ permalink raw reply

* [PATCH] dm ioctl: Remove double parentheses
From: Matthias Kaehlcke @ 2017-03-16 16:48 UTC (permalink / raw)
  To: Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Shaohua Li, linux-raid, linux-kernel, Grant Grundler,
	Michael Davidson, Greg Hackmann, Matthias Kaehlcke

The extra pair of parantheses is not needed and causes clang to generate
the following warning:

drivers/md/dm-ioctl.c:1776:11: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
        if ((cmd == DM_DEV_CREATE_CMD)) {
             ~~~~^~~~~~~~~~~~~~~~~~~~
drivers/md/dm-ioctl.c:1776:11: note: remove extraneous parentheses around the comparison to silence this warning
        if ((cmd == DM_DEV_CREATE_CMD)) {
            ~    ^                   ~
drivers/md/dm-ioctl.c:1776:11: note: use '=' to turn this equality comparison into an assignment
        if ((cmd == DM_DEV_CREATE_CMD)) {
                 ^~
                 =

Also remove another double parentheses that don't cause a warning.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
---
 drivers/md/dm-ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index a5a9b17f0f7f..9360036bf6d3 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1777,12 +1777,12 @@ static int validate_params(uint cmd, struct dm_ioctl *param)
 	    cmd == DM_LIST_VERSIONS_CMD)
 		return 0;
 
-	if ((cmd == DM_DEV_CREATE_CMD)) {
+	if (cmd == DM_DEV_CREATE_CMD) {
 		if (!*param->name) {
 			DMWARN("name not supplied when creating device");
 			return -EINVAL;
 		}
-	} else if ((*param->uuid && *param->name)) {
+	} else if (*param->uuid && *param->name) {
 		DMWARN("only supply one of name or uuid, cmd(%u)", cmd);
 		return -EINVAL;
 	}
-- 
2.12.0.367.g23dc2f6d3c-goog

^ permalink raw reply related

* [GIT PULL] MD update for 4.11-rc3
From: Shaohua Li @ 2017-03-16 16:40 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, linux-raid, neilb

Hi,
Please pull MD fix for 4.11-rc3. This update includes some fixes and cleanups:
- Fix a parity calculation bug of raid5 cache by Song
- Fix a potential deadlock issue by me
- Fix two endian issues by Jason
- Fix a disk limitation issue by Neil
- Other small fixes and cleanup

Thanks,
Shaohua

The following changes since commit ea6200e84182989a3cce9687cf79a23ac44ec4db:

  Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2017-03-08 14:45:31 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git for-linus

for you to fetch changes up to 11353b9d10392e79e32603d2178e75feb25eaf0d:

  md/raid1: fix a trivial typo in comments (2017-03-14 11:10:44 -0700)

----------------------------------------------------------------
Guoqing Jiang (3):
      md-cluster: free md_cluster_info if node leave cluster
      md-cluster: remove useless memset from gather_all_resync_info
      md: move funcs from pers->resize to update_size

Jason Yan (2):
      md: fix super_offset endianness in super_1_rdev_size_change
      md: fix incorrect use of lexx_to_cpu in does_sb_need_changing

NeilBrown (1):
      md: don't impose the MD_SB_DISKS limit on arrays without metadata.

Shaohua Li (3):
      md/raid10: submit bio directly to replacement disk
      md: delete dead code
      md/raid1/10: fix potential deadlock

Song Liu (1):
      md/r5cache: fix set_syndrome_sources() for data in cache

Zhilong Liu (1):
      md/raid1: fix a trivial typo in comments

 drivers/md/md-cluster.c |  2 +-
 drivers/md/md.c         | 27 +++++++++++----------------
 drivers/md/md.h         |  6 ------
 drivers/md/raid1.c      | 29 ++++++++++++++++++++++++-----
 drivers/md/raid10.c     | 41 ++++++++++++++++++++++++++++++++++-------
 drivers/md/raid5.c      |  5 ++---
 6 files changed, 72 insertions(+), 38 deletions(-)

^ permalink raw reply

* [PATCH v3 14/14] md: raid10: avoid direct access to bvec table in handle_reshape_read_error
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
  To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
	Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>

All reshape I/O share pages from 1st copy device, so just use that pages
for avoiding direct access to bvec table in handle_reshape_read_error.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/raid10.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 8db3b0869be4..67cf09c0c649 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4681,7 +4681,10 @@ static int handle_reshape_read_error(struct mddev *mddev,
 	struct r10bio *r10b = &on_stack.r10_bio;
 	int slot = 0;
 	int idx = 0;
-	struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
+	struct page **pages;
+
+	/* reshape IOs share pages from .devs[0].bio */
+	pages = get_resync_pages(r10_bio->devs[0].bio)->pages;
 
 	r10b->sector = r10_bio->sector;
 	__raid10_find_phys(&conf->prev, r10b);
@@ -4710,7 +4713,7 @@ static int handle_reshape_read_error(struct mddev *mddev,
 			success = sync_page_io(rdev,
 					       addr,
 					       s << 9,
-					       bvec[idx].bv_page,
+					       pages[idx],
 					       REQ_OP_READ, 0, false);
 			rdev_dec_pending(rdev, mddev);
 			rcu_read_lock();
-- 
2.9.3


^ permalink raw reply related

* [PATCH v3 13/14] md: raid10: retrieve page from preallocated resync page array
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
  To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
	Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>

Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/raid10.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d2cb68971486..8db3b0869be4 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2075,6 +2075,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 	int i, first;
 	struct bio *tbio, *fbio;
 	int vcnt;
+	struct page **tpages, **fpages;
 
 	atomic_set(&r10_bio->remaining, 1);
 
@@ -2090,6 +2091,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 	fbio = r10_bio->devs[i].bio;
 	fbio->bi_iter.bi_size = r10_bio->sectors << 9;
 	fbio->bi_iter.bi_idx = 0;
+	fpages = get_resync_pages(fbio)->pages;
 
 	vcnt = (r10_bio->sectors + (PAGE_SIZE >> 9) - 1) >> (PAGE_SHIFT - 9);
 	/* now find blocks with errors */
@@ -2104,6 +2106,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			continue;
 		if (i == first)
 			continue;
+
+		tpages = get_resync_pages(tbio)->pages;
 		d = r10_bio->devs[i].devnum;
 		rdev = conf->mirrors[d].rdev;
 		if (!r10_bio->devs[i].bio->bi_error) {
@@ -2116,8 +2120,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 				int len = PAGE_SIZE;
 				if (sectors < (len / 512))
 					len = sectors * 512;
-				if (memcmp(page_address(fbio->bi_io_vec[j].bv_page),
-					   page_address(tbio->bi_io_vec[j].bv_page),
+				if (memcmp(page_address(fpages[j]),
+					   page_address(tpages[j]),
 					   len))
 					break;
 				sectors -= len/512;
@@ -2215,6 +2219,7 @@ static void fix_recovery_read_error(struct r10bio *r10_bio)
 	int idx = 0;
 	int dr = r10_bio->devs[0].devnum;
 	int dw = r10_bio->devs[1].devnum;
+	struct page **pages = get_resync_pages(bio)->pages;
 
 	while (sectors) {
 		int s = sectors;
@@ -2230,7 +2235,7 @@ static void fix_recovery_read_error(struct r10bio *r10_bio)
 		ok = sync_page_io(rdev,
 				  addr,
 				  s << 9,
-				  bio->bi_io_vec[idx].bv_page,
+				  pages[idx],
 				  REQ_OP_READ, 0, false);
 		if (ok) {
 			rdev = conf->mirrors[dw].rdev;
@@ -2238,7 +2243,7 @@ static void fix_recovery_read_error(struct r10bio *r10_bio)
 			ok = sync_page_io(rdev,
 					  addr,
 					  s << 9,
-					  bio->bi_io_vec[idx].bv_page,
+					  pages[idx],
 					  REQ_OP_WRITE, 0, false);
 			if (!ok) {
 				set_bit(WriteErrorSeen, &rdev->flags);
-- 
2.9.3


^ permalink raw reply related

* [PATCH v3 12/14] md: raid10: don't use bio's vec table to manage resync pages
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
  To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
	Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>

Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.

The introduced cost is that we need to allocate (128 + 16) * copies
bytes per r10_bio, and it is fine because the inflight r10_bio for
resync shouldn't be much, as pointed by Shaohua.

Also bio_reset() in raid10_sync_request() and reshape_request()
are removed because all bios are freshly new now in these functions
and not necessary to reset any more.

This patch can be thought as cleanup too.

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/raid10.c | 134 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 82 insertions(+), 52 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 98e53f39b66e..d2cb68971486 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -110,6 +110,24 @@ static void end_reshape(struct r10conf *conf);
 #define raid10_log(md, fmt, args...)				\
 	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid10 " fmt, ##args); } while (0)
 
+/*
+ * 'strct resync_pages' stores actual pages used for doing the resync
+ *  IO, and it is per-bio, so make .bi_private points to it.
+ */
+static inline struct resync_pages *get_resync_pages(struct bio *bio)
+{
+	return bio->bi_private;
+}
+
+/*
+ * for resync bio, r10bio pointer can be retrieved from the per-bio
+ * 'struct resync_pages'.
+ */
+static inline struct r10bio *get_resync_r10bio(struct bio *bio)
+{
+	return get_resync_pages(bio)->raid_bio;
+}
+
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct r10conf *conf = data;
@@ -140,11 +158,11 @@ static void r10bio_pool_free(void *r10_bio, void *data)
 static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
 	struct r10conf *conf = data;
-	struct page *page;
 	struct r10bio *r10_bio;
 	struct bio *bio;
-	int i, j;
-	int nalloc;
+	int j;
+	int nalloc, nalloc_rp;
+	struct resync_pages *rps;
 
 	r10_bio = r10bio_pool_alloc(gfp_flags, conf);
 	if (!r10_bio)
@@ -156,6 +174,15 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 	else
 		nalloc = 2; /* recovery */
 
+	/* allocate once for all bios */
+	if (!conf->have_replacement)
+		nalloc_rp = nalloc;
+	else
+		nalloc_rp = nalloc * 2;
+	rps = kmalloc(sizeof(struct resync_pages) * nalloc_rp, gfp_flags);
+	if (!rps)
+		goto out_free_r10bio;
+
 	/*
 	 * Allocate bios.
 	 */
@@ -175,36 +202,40 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 	 * Allocate RESYNC_PAGES data pages and attach them
 	 * where needed.
 	 */
-	for (j = 0 ; j < nalloc; j++) {
+	for (j = 0; j < nalloc; j++) {
 		struct bio *rbio = r10_bio->devs[j].repl_bio;
+		struct resync_pages *rp, *rp_repl;
+
+		rp = &rps[j];
+		if (rbio)
+			rp_repl = &rps[nalloc + j];
+
 		bio = r10_bio->devs[j].bio;
-		for (i = 0; i < RESYNC_PAGES; i++) {
-			if (j > 0 && !test_bit(MD_RECOVERY_SYNC,
-					       &conf->mddev->recovery)) {
-				/* we can share bv_page's during recovery
-				 * and reshape */
-				struct bio *rbio = r10_bio->devs[0].bio;
-				page = rbio->bi_io_vec[i].bv_page;
-				get_page(page);
-			} else
-				page = alloc_page(gfp_flags);
-			if (unlikely(!page))
+
+		if (!j || test_bit(MD_RECOVERY_SYNC,
+				   &conf->mddev->recovery)) {
+			if (resync_alloc_pages(rp, gfp_flags))
 				goto out_free_pages;
+		} else {
+			memcpy(rp, &rps[0], sizeof(*rp));
+			resync_get_all_pages(rp);
+		}
 
-			bio->bi_io_vec[i].bv_page = page;
-			if (rbio)
-				rbio->bi_io_vec[i].bv_page = page;
+		rp->idx = 0;
+		rp->raid_bio = r10_bio;
+		bio->bi_private = rp;
+		if (rbio) {
+			memcpy(rp_repl, rp, sizeof(*rp));
+			rbio->bi_private = rp_repl;
 		}
 	}
 
 	return r10_bio;
 
 out_free_pages:
-	for ( ; i > 0 ; i--)
-		safe_put_page(bio->bi_io_vec[i-1].bv_page);
-	while (j--)
-		for (i = 0; i < RESYNC_PAGES ; i++)
-			safe_put_page(r10_bio->devs[j].bio->bi_io_vec[i].bv_page);
+	while (--j >= 0)
+		resync_free_pages(&rps[j * 2]);
+
 	j = 0;
 out_free_bio:
 	for ( ; j < nalloc; j++) {
@@ -213,30 +244,34 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 		if (r10_bio->devs[j].repl_bio)
 			bio_put(r10_bio->devs[j].repl_bio);
 	}
+	kfree(rps);
+out_free_r10bio:
 	r10bio_pool_free(r10_bio, conf);
 	return NULL;
 }
 
 static void r10buf_pool_free(void *__r10_bio, void *data)
 {
-	int i;
 	struct r10conf *conf = data;
 	struct r10bio *r10bio = __r10_bio;
 	int j;
+	struct resync_pages *rp = NULL;
 
-	for (j=0; j < conf->copies; j++) {
+	for (j = conf->copies; j--; ) {
 		struct bio *bio = r10bio->devs[j].bio;
-		if (bio) {
-			for (i = 0; i < RESYNC_PAGES; i++) {
-				safe_put_page(bio->bi_io_vec[i].bv_page);
-				bio->bi_io_vec[i].bv_page = NULL;
-			}
-			bio_put(bio);
-		}
+
+		rp = get_resync_pages(bio);
+		resync_free_pages(rp);
+		bio_put(bio);
+
 		bio = r10bio->devs[j].repl_bio;
 		if (bio)
 			bio_put(bio);
 	}
+
+	/* resync pages array stored in the 1st bio's .bi_private */
+	kfree(rp);
+
 	r10bio_pool_free(r10bio, conf);
 }
 
@@ -1937,7 +1972,7 @@ static void __end_sync_read(struct r10bio *r10_bio, struct bio *bio, int d)
 
 static void end_sync_read(struct bio *bio)
 {
-	struct r10bio *r10_bio = bio->bi_private;
+	struct r10bio *r10_bio = get_resync_r10bio(bio);
 	struct r10conf *conf = r10_bio->mddev->private;
 	int d = find_bio_disk(conf, r10_bio, bio, NULL, NULL);
 
@@ -1946,6 +1981,7 @@ static void end_sync_read(struct bio *bio)
 
 static void end_reshape_read(struct bio *bio)
 {
+	/* reshape read bio isn't allocated from r10buf_pool */
 	struct r10bio *r10_bio = bio->bi_private;
 
 	__end_sync_read(r10_bio, bio, r10_bio->read_slot);
@@ -1980,7 +2016,7 @@ static void end_sync_request(struct r10bio *r10_bio)
 
 static void end_sync_write(struct bio *bio)
 {
-	struct r10bio *r10_bio = bio->bi_private;
+	struct r10bio *r10_bio = get_resync_r10bio(bio);
 	struct mddev *mddev = r10_bio->mddev;
 	struct r10conf *conf = mddev->private;
 	int d;
@@ -2060,6 +2096,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 	for (i=0 ; i < conf->copies ; i++) {
 		int  j, d;
 		struct md_rdev *rdev;
+		struct resync_pages *rp;
 
 		tbio = r10_bio->devs[i].bio;
 
@@ -2101,11 +2138,13 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 		 * First we need to fixup bv_offset, bv_len and
 		 * bi_vecs, as the read request might have corrupted these
 		 */
+		rp = get_resync_pages(tbio);
 		bio_reset(tbio);
 
 		tbio->bi_vcnt = vcnt;
 		tbio->bi_iter.bi_size = fbio->bi_iter.bi_size;
-		tbio->bi_private = r10_bio;
+		rp->raid_bio = r10_bio;
+		tbio->bi_private = rp;
 		tbio->bi_iter.bi_sector = r10_bio->devs[i].addr;
 		tbio->bi_end_io = end_sync_write;
 		bio_set_op_attrs(tbio, REQ_OP_WRITE, 0);
@@ -3173,10 +3212,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 					}
 				}
 				bio = r10_bio->devs[0].bio;
-				bio_reset(bio);
 				bio->bi_next = biolist;
 				biolist = bio;
-				bio->bi_private = r10_bio;
 				bio->bi_end_io = end_sync_read;
 				bio_set_op_attrs(bio, REQ_OP_READ, 0);
 				if (test_bit(FailFast, &rdev->flags))
@@ -3200,10 +3237,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 
 				if (!test_bit(In_sync, &mrdev->flags)) {
 					bio = r10_bio->devs[1].bio;
-					bio_reset(bio);
 					bio->bi_next = biolist;
 					biolist = bio;
-					bio->bi_private = r10_bio;
 					bio->bi_end_io = end_sync_write;
 					bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
 					bio->bi_iter.bi_sector = to_addr
@@ -3228,10 +3263,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				if (mreplace == NULL || bio == NULL ||
 				    test_bit(Faulty, &mreplace->flags))
 					break;
-				bio_reset(bio);
 				bio->bi_next = biolist;
 				biolist = bio;
-				bio->bi_private = r10_bio;
 				bio->bi_end_io = end_sync_write;
 				bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
 				bio->bi_iter.bi_sector = to_addr +
@@ -3353,7 +3386,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				r10_bio->devs[i].repl_bio->bi_end_io = NULL;
 
 			bio = r10_bio->devs[i].bio;
-			bio_reset(bio);
 			bio->bi_error = -EIO;
 			rcu_read_lock();
 			rdev = rcu_dereference(conf->mirrors[d].rdev);
@@ -3378,7 +3410,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			atomic_inc(&r10_bio->remaining);
 			bio->bi_next = biolist;
 			biolist = bio;
-			bio->bi_private = r10_bio;
 			bio->bi_end_io = end_sync_read;
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
 			if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
@@ -3397,13 +3428,11 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 
 			/* Need to set up for writing to the replacement */
 			bio = r10_bio->devs[i].repl_bio;
-			bio_reset(bio);
 			bio->bi_error = -EIO;
 
 			sector = r10_bio->devs[i].addr;
 			bio->bi_next = biolist;
 			biolist = bio;
-			bio->bi_private = r10_bio;
 			bio->bi_end_io = end_sync_write;
 			bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
 			if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
@@ -3442,7 +3471,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 		if (len == 0)
 			break;
 		for (bio= biolist ; bio ; bio=bio->bi_next) {
-			page = bio->bi_io_vec[bio->bi_vcnt].bv_page;
+			struct resync_pages *rp = get_resync_pages(bio);
+			page = resync_fetch_page(rp, rp->idx++);
 			/*
 			 * won't fail because the vec table is big enough
 			 * to hold all these pages
@@ -3451,7 +3481,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 		}
 		nr_sectors += len>>9;
 		sector_nr += len>>9;
-	} while (biolist->bi_vcnt < RESYNC_PAGES);
+	} while (get_resync_pages(biolist)->idx < RESYNC_PAGES);
 	r10_bio->sectors = nr_sectors;
 
 	while (biolist) {
@@ -3459,7 +3489,7 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 		biolist = biolist->bi_next;
 
 		bio->bi_next = NULL;
-		r10_bio = bio->bi_private;
+		r10_bio = get_resync_r10bio(bio);
 		r10_bio->sectors = nr_sectors;
 
 		if (bio->bi_end_io == end_sync_read) {
@@ -4354,6 +4384,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 	struct bio *blist;
 	struct bio *bio, *read_bio;
 	int sectors_done = 0;
+	struct page **pages;
 
 	if (sector_nr == 0) {
 		/* If restarting in the middle, skip the initial sectors */
@@ -4504,11 +4535,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 		if (!rdev2 || test_bit(Faulty, &rdev2->flags))
 			continue;
 
-		bio_reset(b);
 		b->bi_bdev = rdev2->bdev;
 		b->bi_iter.bi_sector = r10_bio->devs[s/2].addr +
 			rdev2->new_data_offset;
-		b->bi_private = r10_bio;
 		b->bi_end_io = end_reshape_write;
 		bio_set_op_attrs(b, REQ_OP_WRITE, 0);
 		b->bi_next = blist;
@@ -4518,8 +4547,9 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 	/* Now add as many pages as possible to all of these bios. */
 
 	nr_sectors = 0;
+	pages = get_resync_pages(r10_bio->devs[0].bio)->pages;
 	for (s = 0 ; s < max_sectors; s += PAGE_SIZE >> 9) {
-		struct page *page = r10_bio->devs[0].bio->bi_io_vec[s/(PAGE_SIZE>>9)].bv_page;
+		struct page *page = pages[s / (PAGE_SIZE >> 9)];
 		int len = (max_sectors - s) << 9;
 		if (len > PAGE_SIZE)
 			len = PAGE_SIZE;
@@ -4703,7 +4733,7 @@ static int handle_reshape_read_error(struct mddev *mddev,
 
 static void end_reshape_write(struct bio *bio)
 {
-	struct r10bio *r10_bio = bio->bi_private;
+	struct r10bio *r10_bio = get_resync_r10bio(bio);
 	struct mddev *mddev = r10_bio->mddev;
 	struct r10conf *conf = mddev->private;
 	int d;
-- 
2.9.3

^ permalink raw reply related

* [PATCH v3 11/14] md: raid10: refactor code of read reshape's .bi_end_io
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
  To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
	Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>

reshape read request is a bit special and requires one extra
bio which isn't allocated from r10buf_pool.

Refactor the .bi_end_io for read reshape, so that we can use
raid10's resync page mangement approach easily in the following
patches.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/raid10.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 2b40420299e3..98e53f39b66e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1909,17 +1909,9 @@ static int raid10_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 	return err;
 }
 
-static void end_sync_read(struct bio *bio)
+static void __end_sync_read(struct r10bio *r10_bio, struct bio *bio, int d)
 {
-	struct r10bio *r10_bio = bio->bi_private;
 	struct r10conf *conf = r10_bio->mddev->private;
-	int d;
-
-	if (bio == r10_bio->master_bio) {
-		/* this is a reshape read */
-		d = r10_bio->read_slot; /* really the read dev */
-	} else
-		d = find_bio_disk(conf, r10_bio, bio, NULL, NULL);
 
 	if (!bio->bi_error)
 		set_bit(R10BIO_Uptodate, &r10_bio->state);
@@ -1943,6 +1935,22 @@ static void end_sync_read(struct bio *bio)
 	}
 }
 
+static void end_sync_read(struct bio *bio)
+{
+	struct r10bio *r10_bio = bio->bi_private;
+	struct r10conf *conf = r10_bio->mddev->private;
+	int d = find_bio_disk(conf, r10_bio, bio, NULL, NULL);
+
+	__end_sync_read(r10_bio, bio, d);
+}
+
+static void end_reshape_read(struct bio *bio)
+{
+	struct r10bio *r10_bio = bio->bi_private;
+
+	__end_sync_read(r10_bio, bio, r10_bio->read_slot);
+}
+
 static void end_sync_request(struct r10bio *r10_bio)
 {
 	struct mddev *mddev = r10_bio->mddev;
@@ -4466,7 +4474,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 	read_bio->bi_iter.bi_sector = (r10_bio->devs[r10_bio->read_slot].addr
 			       + rdev->data_offset);
 	read_bio->bi_private = r10_bio;
-	read_bio->bi_end_io = end_sync_read;
+	read_bio->bi_end_io = end_reshape_read;
 	bio_set_op_attrs(read_bio, REQ_OP_READ, 0);
 	read_bio->bi_flags &= (~0UL << BIO_RESET_BITS);
 	read_bio->bi_error = 0;
-- 
2.9.3


^ permalink raw reply related

* [PATCH v3 10/14] md: raid1: improve write behind
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
  To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
	Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>

This patch improve handling of write behind in the following ways:

- introduce behind master bio to hold all write behind pages
- fast clone bios from behind master bio
- avoid to change bvec table directly
- use bio_copy_data() and make code more clean

Suggested-by: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/raid1.c | 118 ++++++++++++++++++++++++-----------------------------
 drivers/md/raid1.h |  10 +++--
 2 files changed, 61 insertions(+), 67 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 2f3622c695ce..3c13286190c1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -405,12 +405,9 @@ static void close_write(struct r1bio *r1_bio)
 {
 	/* it really is the end of this request */
 	if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
-		/* free extra copy of the data pages */
-		int i = r1_bio->behind_page_count;
-		while (i--)
-			safe_put_page(r1_bio->behind_bvecs[i].bv_page);
-		kfree(r1_bio->behind_bvecs);
-		r1_bio->behind_bvecs = NULL;
+		bio_free_pages(r1_bio->behind_master_bio);
+		bio_put(r1_bio->behind_master_bio);
+		r1_bio->behind_master_bio = NULL;
 	}
 	/* clear the bitmap if all writes complete successfully */
 	bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
@@ -512,6 +509,10 @@ static void raid1_end_write_request(struct bio *bio)
 	}
 
 	if (behind) {
+		/* we release behind master bio when all write are done */
+		if (r1_bio->behind_master_bio == bio)
+			to_put = NULL;
+
 		if (test_bit(WriteMostly, &rdev->flags))
 			atomic_dec(&r1_bio->behind_remaining);
 
@@ -1096,39 +1097,46 @@ static void unfreeze_array(struct r1conf *conf)
 	wake_up(&conf->wait_barrier);
 }
 
-/* duplicate the data pages for behind I/O
- */
-static void alloc_behind_pages(struct bio *bio, struct r1bio *r1_bio)
+static struct bio *alloc_behind_master_bio(struct r1bio *r1_bio,
+					   struct bio *bio,
+					   int offset, int size)
 {
-	int i;
-	struct bio_vec *bvec;
-	struct bio_vec *bvecs = kzalloc(bio->bi_vcnt * sizeof(struct bio_vec),
-					GFP_NOIO);
-	if (unlikely(!bvecs))
-		return;
+	unsigned vcnt = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	int i = 0;
+	struct bio *behind_bio = NULL;
+
+	behind_bio = bio_alloc_mddev(GFP_NOIO, vcnt, r1_bio->mddev);
+	if (!behind_bio)
+		goto fail;
+
+	while (i < vcnt && size) {
+		struct page *page;
+		int len = min_t(int, PAGE_SIZE, size);
+
+		page = alloc_page(GFP_NOIO);
+		if (unlikely(!page))
+			goto free_pages;
+
+		bio_add_page(behind_bio, page, len, 0);
+
+		size -= len;
+		i++;
+	}
 
-	bio_for_each_segment_all(bvec, bio, i) {
-		bvecs[i] = *bvec;
-		bvecs[i].bv_page = alloc_page(GFP_NOIO);
-		if (unlikely(!bvecs[i].bv_page))
-			goto do_sync_io;
-		memcpy(kmap(bvecs[i].bv_page) + bvec->bv_offset,
-		       kmap(bvec->bv_page) + bvec->bv_offset, bvec->bv_len);
-		kunmap(bvecs[i].bv_page);
-		kunmap(bvec->bv_page);
-	}
-	r1_bio->behind_bvecs = bvecs;
-	r1_bio->behind_page_count = bio->bi_vcnt;
+	bio_copy_data_partial(behind_bio, bio, offset,
+			      behind_bio->bi_iter.bi_size);
+
+	r1_bio->behind_master_bio = behind_bio;;
 	set_bit(R1BIO_BehindIO, &r1_bio->state);
-	return;
 
-do_sync_io:
-	for (i = 0; i < bio->bi_vcnt; i++)
-		if (bvecs[i].bv_page)
-			put_page(bvecs[i].bv_page);
-	kfree(bvecs);
+	return behind_bio;
+
+ free_pages:
 	pr_debug("%dB behind alloc failed, doing sync I/O\n",
 		 bio->bi_iter.bi_size);
+	bio_free_pages(behind_bio);
+ fail:
+	return behind_bio;
 }
 
 struct raid1_plug_cb {
@@ -1499,11 +1507,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 			    (atomic_read(&bitmap->behind_writes)
 			     < mddev->bitmap_info.max_write_behind) &&
 			    !waitqueue_active(&bitmap->behind_wait)) {
-				mbio = bio_clone_bioset_partial(bio, GFP_NOIO,
-								mddev->bio_set,
-								offset << 9,
-								max_sectors << 9);
-				alloc_behind_pages(mbio, r1_bio);
+				mbio = alloc_behind_master_bio(r1_bio, bio,
+							       offset << 9,
+							       max_sectors << 9);
 			}
 
 			bitmap_startwrite(bitmap, r1_bio->sector,
@@ -1514,26 +1520,17 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 		}
 
 		if (!mbio) {
-			if (r1_bio->behind_bvecs)
-				mbio = bio_clone_bioset_partial(bio, GFP_NOIO,
-								mddev->bio_set,
-								offset << 9,
-								max_sectors << 9);
+			if (r1_bio->behind_master_bio)
+				mbio = bio_clone_fast(r1_bio->behind_master_bio,
+						      GFP_NOIO,
+						      mddev->bio_set);
 			else {
 				mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
 				bio_trim(mbio, offset, max_sectors);
 			}
 		}
 
-		if (r1_bio->behind_bvecs) {
-			struct bio_vec *bvec;
-			int j;
-
-			/*
-			 * We trimmed the bio, so _all is legit
-			 */
-			bio_for_each_segment_all(bvec, mbio, j)
-				bvec->bv_page = r1_bio->behind_bvecs[j].bv_page;
+		if (r1_bio->behind_master_bio) {
 			if (test_bit(WriteMostly, &conf->mirrors[i].rdev->flags))
 				atomic_inc(&r1_bio->behind_remaining);
 		}
@@ -2405,18 +2402,11 @@ static int narrow_write_error(struct r1bio *r1_bio, int i)
 		/* Write at 'sector' for 'sectors'*/
 
 		if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
-			unsigned vcnt = r1_bio->behind_page_count;
-			struct bio_vec *vec = r1_bio->behind_bvecs;
-
-			while (!vec->bv_page) {
-				vec++;
-				vcnt--;
-			}
-
-			wbio = bio_alloc_mddev(GFP_NOIO, vcnt, mddev);
-			memcpy(wbio->bi_io_vec, vec, vcnt * sizeof(struct bio_vec));
-
-			wbio->bi_vcnt = vcnt;
+			wbio = bio_clone_fast(r1_bio->behind_master_bio,
+					      GFP_NOIO,
+					      mddev->bio_set);
+			/* We really need a _all clone */
+			wbio->bi_iter = (struct bvec_iter){ 0 };
 		} else {
 			wbio = bio_clone_fast(r1_bio->master_bio, GFP_NOIO,
 					      mddev->bio_set);
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index dd22a37d0d83..4271cd7ac2de 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -153,9 +153,13 @@ struct r1bio {
 	int			read_disk;
 
 	struct list_head	retry_list;
-	/* Next two are only valid when R1BIO_BehindIO is set */
-	struct bio_vec		*behind_bvecs;
-	int			behind_page_count;
+
+	/*
+	 * When R1BIO_BehindIO is set, we store pages for write behind
+	 * in behind_master_bio.
+	 */
+	struct bio		*behind_master_bio;
+
 	/*
 	 * if the IO is in WRITE direction, then multiple bios are used.
 	 * We choose the number when they are allocated.
-- 
2.9.3


^ permalink raw reply related

* [PATCH v3 09/14] md: raid1: move 'offset' out of loop
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
  To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
	Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>

The 'offset' local variable can't be changed inside the loop, so
move it out.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/raid1.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4034a2963da8..2f3622c695ce 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1317,6 +1317,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
+	sector_t offset;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -1481,13 +1482,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 	atomic_set(&r1_bio->behind_remaining, 0);
 
 	first_clone = 1;
+
+	offset = r1_bio->sector - bio->bi_iter.bi_sector;
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio = NULL;
-		sector_t offset;
 		if (!r1_bio->bios[i])
 			continue;
 
-		offset = r1_bio->sector - bio->bi_iter.bi_sector;
 
 		if (first_clone) {
 			/* do behind I/O ?
-- 
2.9.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox