Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [PATCH] mdadm:fix typo in comment
From: Zhilong Liu @ 2017-03-01  8:44 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

Signed-off-by: Zhilong Liu <zlliu@suse.com>

diff --git a/mdadm.c b/mdadm.c
index b5d89e4..16fd49a 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1268,7 +1268,7 @@ int main(int argc, char *argv[])
 	 * hopefully it's mostly right but there might be some stuff
 	 * missing
 	 *
-	 * That is mosty checked in the per-mode stuff but...
+	 * That is mostly checked in the per-mode stuff but...
 	 *
 	 * For @,B,C and A without -s, the first device listed must be
 	 * an md device.  We check that here and open it.
-- 
2.6.6


^ permalink raw reply related

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Milan Broz @ 2017-03-01  9:29 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Binoy Jayan, Rajendra, Herbert Xu, Oded, Mike Snitzer,
	Linux kernel mailing list, Ondrej Mosnacek, linux-raid, dm-devel,
	Mark Brown, Arnd Bergmann, linux-crypto, Shaohua Li,
	David S. Miller, Alasdair Kergon, Ofir
In-Reply-To: <CAOtvUMf3c0fUjABmbfMrbs0W4gyhKVZyc_tZH3rQg65zF18YMg@mail.gmail.com>

On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote:
> On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz <gmazyland@gmail.com> wrote:
>>
>> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
>>>
>>> I was wondering if this is near to be ready for submission (apart from
>>> the testmgr.c
>>> changes) or I need to make some changes to make it similar to the IPSec offload?
>>
>> I just tried this and except it registers the IV for every new device again, it works...
>> (After a while you have many duplicate entries in /proc/crypto.)
>>
>> But I would like to see some summary why such a big patch is needed in the first place.
>> (During an internal discussions seems that people are already lost in mails and
>> patches here, so Ondra promised me to send some summary mail soon here.)
>>
>> IIRC the first initial problem was dmcrypt performance on some embedded
>> crypto processors that are not able to cope with small crypto requests effectively.
>>
>>
>> Do you have some real performance numbers that proves that such a patch is adequate?
>>
>> I would really like to see the performance issue fixed but I am really not sure
>> this approach works for everyone. It would be better to avoid repeating this exercise later.
>> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
>> to speedup things even for crypt drivers that do not support own IV generators.
>>
> 
> AFAIK the problem that we are trying to solve is that if the IV is
> generated outside the crypto API
> domain than you are forced to have an invocation of the crypto API per
> each block because you
> need to provide the IV for each block.
> 
> By putting the IV generation responsibility in the Crypto API we open
> the way to do a single invocation
> of the crypto API for a whole sequence of blocks.

Sure, but this is theory. Does it really work on some hw already?
Do you have performance measurements or comparison?

> For software implementation of XTS this doesn't matter much - but for
> hardware based XTS providers

It is not only embedded crypto, we have some more reports in the past
that 512B sectors are not ideal even for other systems.
(IIRC it was also with AES-NI that represents really big group of users).

> This lead some vendors to ship hacked up versions of dm-crypt to match
> the specific crypto hardware
> they were using, or so I've heard at least - didn't see the code myself.

I saw few version of that. There was a very hacky way to provide request-based dmcrypt
(see old "Introduce the request handling for dm-crypt" thread on dm-devel).
This is not the acceptable way but definitely it points to the same problem.

> I believe Binoy is trying to address this in a generic upstream worthy
> way instead.

IIRC the problem is performance, if we can solve it by some generic way,
good, but for now it seems to be a big change and just hope it helps later...

> Anyway, you are only supposed to see s difference when using a
> hardware based XTS provider algo
> that supports IV generation.
> 
>> I like the patch is now contained inside dmcrypt, but it still exposes IVs that
>> are designed just for old, insecure, compatibility-only containers.
>>
>> I really do not think every compatible crap must be accessible through crypto API.
>> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that this way
>> if I know it is accessible outside of dmcrypt internals...)
>> Even the ESSIV is something that was born to fix predictive IVs (CBC watermarking
>> attacks) for disk encryption only, no reason to expose it outside of disk encryption.
>>
> 
> The point is that you have more than one implementation of these
> "compatible crap" - the
> software implementation that you wrote and potentially multiple
> hardware implementations
> and putting this in the crypto API domain is the way to abstract this
> so you use the one
> that works best of your platform.

For XTS you need just simple linear IV. No problem with that, implementation
in crypto API and hw is trivial.

But for compatible IV (that provides compatibility with loopAES and very old TrueCrypt),
these should be never ever implemented again anywhere.

Specifically "tcw" is broken, insecure and provided here just to help people to migrate
from old Truecrypt containers. Even Truecrypt followers removed it from the codebase.
(It is basically combination of IV and slight modification of CBC mode. All
recent version switched to XTS and plain IV.)

So building abstraction over something known to be broken and that is now intentionally
isolated inside dmcrypt is, in my opinion, really not a good idea.

But please do get me wrong,  I do not want to block any improvement.

But it seems to me that this thread focused on creating nice crypto API interface
for FDE IVs instead of demonstration that the proposed solution really solves
the performance issue.
And not only for your hw driver, maybe other systems could benefit from the better
processing of small requests as well.

Milan

^ permalink raw reply

* [PATCH V2 1/5] md-cluster: use sync way to handle METADATA_UPDATED msg
From: Guoqing Jiang @ 2017-03-01  9:30 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang
In-Reply-To: <1488357760-7893-1-git-send-email-gqjiang@suse.com>

Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                  -> md_update_sb-metadata_update_start
		     (want EX on token however token is
		      got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
        MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

There are also some related changes inside this commit:
1. remove RELOAD_SB related codes since there are not valid anymore.
2. mddev is added into md_cluster_info then we can get mddev inside
   lock_token.
3. add new parameter for lock_token to distinguish reconfig_mutex
   is held or not.

And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
1. set it before unregister thread, otherwise a deadlock could
   appear if stop a resyncing array.
   This is because md_unregister_thread(&cinfo->recv_thread) is
   blocked by recv_daemon -> process_recvd_msg
			  -> process_metadata_update.
   To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
   also need to be set before unregister thread.
2. set it in metadata_update_start to fix another deadlock.
	a. Node A sends METADATA_UPDATED msg (held Token lock).
	b. Node B wants to do resync, and is blocked since it can't
	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
	   not set since the callchain
	   (md_do_sync -> sync_request
        	       -> resync_info_update
		       -> sendmsg
		       -> lock_comm -> lock_token)
	   doesn't hold reconfig_mutex.
	c. Node B trys to update sb (held reconfig_mutex), but stopped
	   at wait_event() in metadata_update_start since we have set
	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
	d. Then Node B receives METADATA_UPDATED msg from A, of course
	   recv_daemon is blocked forever.
   Since metadata_update_start always calls lock_token with reconfig_mutex,
   we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
   lock_token don't need to set it twice unless lock_token is invoked from
   lock_comm.

Finally, thanks to Neil for his great idea and help!

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 82 +++++++++++++++++++++++++++++++++++++++----------
 drivers/md/md.c         |  4 ---
 drivers/md/md.h         |  3 --
 3 files changed, 66 insertions(+), 23 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 321ecac23027..5cf0a9d29bf0 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -67,9 +67,10 @@ struct resync_info {
  * set up all the related infos such as bitmap and personality */
 #define		MD_CLUSTER_ALREADY_IN_CLUSTER		6
 #define		MD_CLUSTER_PENDING_RECV_EVENT		7
-
+#define 	MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD		8
 
 struct md_cluster_info {
+	struct mddev *mddev; /* the md device which md_cluster_info belongs to */
 	/* dlm lock space and resources for clustered raid. */
 	dlm_lockspace_t *lockspace;
 	int slot_number;
@@ -523,11 +524,17 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
 
 static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
 {
+	int got_lock = 0;
 	struct md_cluster_info *cinfo = mddev->cluster_info;
 	mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
-	set_bit(MD_RELOAD_SB, &mddev->flags);
+
 	dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
-	md_wakeup_thread(mddev->thread);
+	wait_event(mddev->thread->wqueue,
+		   (got_lock = mddev_trylock(mddev)) ||
+		    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
+	md_reload_sb(mddev, mddev->good_device_nr);
+	if (got_lock)
+		mddev_unlock(mddev);
 }
 
 static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
@@ -646,11 +653,29 @@ static void recv_daemon(struct md_thread *thread)
  * Takes the lock on the TOKEN lock resource so no other
  * node can communicate while the operation is underway.
  */
-static int lock_token(struct md_cluster_info *cinfo)
+static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
 {
-	int error;
+	int error, set_bit = 0;
+	struct mddev *mddev = cinfo->mddev;
 
+	/*
+	 * If resync thread run after raid1d thread, then process_metadata_update
+	 * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
+	 * since another node already got EX on Token and waitting the EX of Ack),
+	 * so let resync wake up thread in case flag is set.
+	 */
+	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+				      &cinfo->state)) {
+		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+					      &cinfo->state);
+		WARN_ON_ONCE(error);
+		md_wakeup_thread(mddev->thread);
+		set_bit = 1;
+	}
 	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
+	if (set_bit)
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+
 	if (error)
 		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
 				__func__, __LINE__, error);
@@ -663,12 +688,12 @@ static int lock_token(struct md_cluster_info *cinfo)
 /* lock_comm()
  * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
  */
-static int lock_comm(struct md_cluster_info *cinfo)
+static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
 {
 	wait_event(cinfo->wait,
 		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
 
-	return lock_token(cinfo);
+	return lock_token(cinfo, mddev_locked);
 }
 
 static void unlock_comm(struct md_cluster_info *cinfo)
@@ -743,11 +768,12 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
 	return error;
 }
 
-static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
+static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
+		   bool mddev_locked)
 {
 	int ret;
 
-	lock_comm(cinfo);
+	lock_comm(cinfo, mddev_locked);
 	ret = __sendmsg(cinfo, cmsg);
 	unlock_comm(cinfo);
 	return ret;
@@ -834,6 +860,7 @@ static int join(struct mddev *mddev, int nodes)
 	mutex_init(&cinfo->recv_mutex);
 
 	mddev->cluster_info = cinfo;
+	cinfo->mddev = mddev;
 
 	memset(str, 0, 64);
 	sprintf(str, "%pU", mddev->uuid);
@@ -908,6 +935,7 @@ static int join(struct mddev *mddev, int nodes)
 
 	return 0;
 err:
+	set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 	md_unregister_thread(&cinfo->recovery_thread);
 	md_unregister_thread(&cinfo->recv_thread);
 	lockres_free(cinfo->message_lockres);
@@ -943,7 +971,7 @@ static void resync_bitmap(struct mddev *mddev)
 	int err;
 
 	cmsg.type = cpu_to_le32(BITMAP_NEEDS_SYNC);
-	err = sendmsg(cinfo, &cmsg);
+	err = sendmsg(cinfo, &cmsg, 1);
 	if (err)
 		pr_err("%s:%d: failed to send BITMAP_NEEDS_SYNC message (%d)\n",
 			__func__, __LINE__, err);
@@ -963,6 +991,7 @@ static int leave(struct mddev *mddev)
 	if (cinfo->slot_number > 0 && mddev->recovery_cp != MaxSector)
 		resync_bitmap(mddev);
 
+	set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 	md_unregister_thread(&cinfo->recovery_thread);
 	md_unregister_thread(&cinfo->recv_thread);
 	lockres_free(cinfo->message_lockres);
@@ -997,16 +1026,30 @@ static int slot_number(struct mddev *mddev)
 static int metadata_update_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
+	int ret;
+
+	/*
+	 * metadata_update_start is always called with the protection of
+	 * reconfig_mutex, so set WAITING_FOR_TOKEN here.
+	 */
+	ret = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+				    &cinfo->state);
+	WARN_ON_ONCE(ret);
+	md_wakeup_thread(mddev->thread);
 
 	wait_event(cinfo->wait,
 		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state) ||
 		   test_and_clear_bit(MD_CLUSTER_SEND_LOCKED_ALREADY, &cinfo->state));
 
 	/* If token is already locked, return 0 */
-	if (cinfo->token_lockres->mode == DLM_LOCK_EX)
+	if (cinfo->token_lockres->mode == DLM_LOCK_EX) {
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 		return 0;
+	}
 
-	return lock_token(cinfo);
+	ret = lock_token(cinfo, 1);
+	clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+	return ret;
 }
 
 static int metadata_update_finish(struct mddev *mddev)
@@ -1069,7 +1112,14 @@ static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
 	cmsg.low = cpu_to_le64(lo);
 	cmsg.high = cpu_to_le64(hi);
 
-	return sendmsg(cinfo, &cmsg);
+	/*
+	 * mddev_lock is held if resync_info_update is called from
+	 * resync_finish (md_reap_sync_thread -> resync_finish)
+	 */
+	if (lo == 0 && hi == 0)
+		return sendmsg(cinfo, &cmsg, 1);
+	else
+		return sendmsg(cinfo, &cmsg, 0);
 }
 
 static int resync_finish(struct mddev *mddev)
@@ -1119,7 +1169,7 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
 	cmsg.type = cpu_to_le32(NEWDISK);
 	memcpy(cmsg.uuid, uuid, 16);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	lock_comm(cinfo);
+	lock_comm(cinfo, 1);
 	ret = __sendmsg(cinfo, &cmsg);
 	if (ret)
 		return ret;
@@ -1179,7 +1229,7 @@ static int remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 	struct md_cluster_info *cinfo = mddev->cluster_info;
 	cmsg.type = cpu_to_le32(REMOVE);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	return sendmsg(cinfo, &cmsg);
+	return sendmsg(cinfo, &cmsg, 1);
 }
 
 static int lock_all_bitmaps(struct mddev *mddev)
@@ -1243,7 +1293,7 @@ static int gather_bitmaps(struct md_rdev *rdev)
 
 	cmsg.type = cpu_to_le32(RE_ADD);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	err = sendmsg(cinfo, &cmsg);
+	err = sendmsg(cinfo, &cmsg, 1);
 	if (err)
 		goto out;
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index f25bd6de0c72..44206bc6e3aa 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8384,7 +8384,6 @@ void md_check_recovery(struct mddev *mddev)
 		(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
 		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
-		test_bit(MD_RELOAD_SB, &mddev->flags) ||
 		(mddev->external == 0 && mddev->safemode == 1) ||
 		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
@@ -8433,9 +8432,6 @@ void md_check_recovery(struct mddev *mddev)
 						rdev->raid_disk < 0)
 					md_kick_rdev_from_array(rdev);
 			}
-
-			if (test_and_clear_bit(MD_RELOAD_SB, &mddev->flags))
-				md_reload_sb(mddev, mddev->good_device_nr);
 		}
 
 		if (!mddev->external) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index dde8ecb760c8..1c00160b09f9 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -219,9 +219,6 @@ enum mddev_flags {
 				 * it then */
 	MD_JOURNAL_CLEAN,	/* A raid with journal is already clean */
 	MD_HAS_JOURNAL,		/* The raid array has journal feature set */
-	MD_RELOAD_SB,		/* Reload the superblock because another node
-				 * updated it.
-				 */
 	MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
 				   * already took resync lock, need to
 				   * release the lock */
-- 
2.6.2


^ permalink raw reply related

* [PATCH] mdadm:check the nodes when operate clustered array
From: Zhilong Liu @ 2017-03-01 10:42 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

It doesn't make sense to write_bitmap with less than 2 nodes,
in order to avoid 'write_bitmap' received invalid nodes number,
it would be better to do checking nodes in getopt operations.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
such as following method would reproduce the issue.
# mdadm -CR /dev/md0 -l1 -n2 --bitmap=internal /dev/sda /dev/sdb
# mdadm --grow /dev/md0 --bitmap=none
# mdadm --grow /dev/md0 --bitmap=clustered --nodes=1
# mdadm -X /dev/sda, it would be written internal bitmap.

diff --git a/mdadm.c b/mdadm.c
index b5d89e4..747101c 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -603,8 +603,8 @@ int main(int argc, char *argv[])
 		case O(GROW, Nodes):
 		case O(CREATE, Nodes):
 			c.nodes = parse_num(optarg);
-			if (c.nodes <= 0) {
-				pr_err("invalid number for the number of cluster nodes: %s\n",
+			if (c.nodes < 2) {
+				pr_err("clustered array needs two nodes at least: %s\n",
 					optarg);
 				exit(2);
 			}
diff --git a/super1.c b/super1.c
index 87a74cb..882cd61 100644
--- a/super1.c
+++ b/super1.c
@@ -2380,14 +2380,7 @@ static int write_bitmap1(struct supertype *st, int fd, enum bitmap_update update
 		}
 
 		if (bms->version == BITMAP_MAJOR_CLUSTERED) {
-			if (st->nodes == 1) {
-				/* the parameter for nodes is not valid */
-				pr_err("Warning: cluster-md at least needs two nodes\n");
-				return -EINVAL;
-			} else if (st->nodes == 0)
-				/* --nodes is not specified */
-				break;
-			else if (__cpu_to_le32(st->nodes) < bms->nodes) {
+			if (__cpu_to_le32(st->nodes) < bms->nodes) {
 				/* Since the nodes num is not increased, no need to check the space
 				 * is enough or not, just update bms->nodes */
 				bms->nodes = __cpu_to_le32(st->nodes);
-- 
2.6.6


^ permalink raw reply related

* Re: [PATCH] mdadm: add checking the clustered bitmap in assemble mode
From: Coly Li @ 2017-03-01 12:03 UTC (permalink / raw)
  To: Zhilong Liu, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <1488354469-19137-1-git-send-email-zlliu@suse.com>

On 2017/3/1 下午3:47, Zhilong Liu wrote:
> 1. both clustered and internal array don't need to specify
> --bitmap when assembling array.
> 2. clustered array doesn't support external bitmap mode.
> 
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> 
> diff --git a/mdadm.c b/mdadm.c
> index b5d89e4..b08cace 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -1095,8 +1095,14 @@ int main(int argc, char *argv[])
>  				pr_err("bitmap file needed with -b in --assemble mode\n");
>  				exit(2);
>  			}
> -			if (strcmp(optarg, "internal") == 0) {
> -				pr_err("there is no need to specify --bitmap when assembling arrays with internal bitmaps\n");
> +			if (strcmp(optarg, "internal") == 0 ||
> +			    strcmp(optarg, "clustered") == 0) {
> +				pr_err("no need to specify --bitmap when assembling arrays with internal or clustered bitmaps\n");

Is it important for 80 characters width limitation ?

> +				continue;

Here if optarg is "clustered", it will continue and won't check the
following strcmp.

> +			}
> +			if (strcmp(optarg, "clustered") == 0 &&
> +			    strchr(optarg, '/') != NULL) {

I guess this check won't happen at all. Correct me if I am wrong.

> +				pr_err("clustered array doesn't support external bitmap\n");
>  				continue;
>  			}
>  			bitmap_fd = open(optarg, O_RDWR);
> 

Thanks.

Coly

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Gilad Ben-Yossef @ 2017-03-01 12:42 UTC (permalink / raw)
  To: Milan Broz
  Cc: Binoy Jayan, Rajendra, Herbert Xu, Oded, Mike Snitzer,
	Linux kernel mailing list, Ondrej Mosnacek, linux-raid, dm-devel,
	Mark Brown, Arnd Bergmann, linux-crypto, Shaohua Li,
	David S. Miller, Alasdair Kergon, Ofir
In-Reply-To: <b563eb97-82ba-69d2-c4c5-66bc716a7507@gmail.com>

On Wed, Mar 1, 2017 at 11:29 AM, Milan Broz <gmazyland@gmail.com> wrote:
>
> On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote:
> > On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz <gmazyland@gmail.com> wrote:
> >>
> >> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
> >>>
> >>> I was wondering if this is near to be ready for submission (apart from
> >>> the testmgr.c
> >>> changes) or I need to make some changes to make it similar to the IPSec offload?
> >>
> >> I just tried this and except it registers the IV for every new device again, it works...
> >> (After a while you have many duplicate entries in /proc/crypto.)
> >>
> >> But I would like to see some summary why such a big patch is needed in the first place.
> >> (During an internal discussions seems that people are already lost in mails and
> >> patches here, so Ondra promised me to send some summary mail soon here.)
> >>
> >> IIRC the first initial problem was dmcrypt performance on some embedded
> >> crypto processors that are not able to cope with small crypto requests effectively.
> >>
> >>
> >> Do you have some real performance numbers that proves that such a patch is adequate?
> >>
> >> I would really like to see the performance issue fixed but I am really not sure
> >> this approach works for everyone. It would be better to avoid repeating this exercise later.
> >> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
> >> to speedup things even for crypt drivers that do not support own IV generators.
> >>
> >
> > AFAIK the problem that we are trying to solve is that if the IV is
> > generated outside the crypto API
> > domain than you are forced to have an invocation of the crypto API per
> > each block because you
> > need to provide the IV for each block.
> >
> > By putting the IV generation responsibility in the Crypto API we open
> > the way to do a single invocation
> > of the crypto API for a whole sequence of blocks.
>
> Sure, but this is theory. Does it really work on some hw already?
> Do you have performance measurements or comparison?

I'm working on up streaming a driver for Arm  CryptoCell that supports
this and working
offline to get Binoy a board to test this with. Alas, shipping crypto
HW has its fair share
of regulatory challenges... :-)

I can certainly understand if you don't wont to take the patch until
we have results with
dm-crypt itself but the difference between 8 separate invocation of
the engine for 512
bytes of XTS and a single invocation for 4KB are pretty big.

From what I know of HW engines I'd be surprised if this is in any way
unique to CryptoCell.

> > For software implementation of XTS this doesn't matter much - but for
> > hardware based XTS providers
>
> It is not only embedded crypto, we have some more reports in the past
> that 512B sectors are not ideal even for other systems.
> (IIRC it was also with AES-NI that represents really big group of users).

I never said anything about embedded :-)

It really is an observation about overhead of context switches between
dm-crypt and
whatever/wherever you handle crypto - be it an off CPU hardware engine
or a bunch
of parallel kernel threads running on other cores. You really want to
burst as much as
possible.


>
> > This lead some vendors to ship hacked up versions of dm-crypt to match
> > the specific crypto hardware
> > they were using, or so I've heard at least - didn't see the code myself.
>
> I saw few version of that. There was a very hacky way to provide request-based dmcrypt
> (see old "Introduce the request handling for dm-crypt" thread on dm-devel).
> This is not the acceptable way but definitely it points to the same problem.
>
> > I believe Binoy is trying to address this in a generic upstream worthy
> > way instead.
>
> IIRC the problem is performance, if we can solve it by some generic way,
> good, but for now it seems to be a big change and just hope it helps later...
>

I see what you're saying. We need number to back this up.

> > Anyway, you are only supposed to see s difference when using a
> > hardware based XTS provider algo
> > that supports IV generation.
> >
> >> I like the patch is now contained inside dmcrypt, but it still exposes IVs that
> >> are designed just for old, insecure, compatibility-only containers.
> >>
> >> I really do not think every compatible crap must be accessible through crypto API.
> >> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that this way
> >> if I know it is accessible outside of dmcrypt internals...)
> >> Even the ESSIV is something that was born to fix predictive IVs (CBC watermarking
> >> attacks) for disk encryption only, no reason to expose it outside of disk encryption.
> >>
> >
> > The point is that you have more than one implementation of these
> > "compatible crap" - the
> > software implementation that you wrote and potentially multiple
> > hardware implementations
> > and putting this in the crypto API domain is the way to abstract this
> > so you use the one
> > that works best of your platform.
>
> For XTS you need just simple linear IV. No problem with that, implementation
> in crypto API and hw is trivial.
>
> But for compatible IV (that provides compatibility with loopAES and very old TrueCrypt),
> these should be never ever implemented again anywhere.

>
> Specifically "tcw" is broken, insecure and provided here just to help people to migrate
> from old Truecrypt containers. Even Truecrypt followers removed it from the codebase.
> (It is basically combination of IV and slight modification of CBC mode. All
> recent version switched to XTS and plain IV.)
>
> So building abstraction over something known to be broken and that is now intentionally
> isolated inside dmcrypt is, in my opinion, really not a good idea.
>

I don't think anyone is interested in these modes. How do you support
XTS and essiv in
a generic way without supporting this broken modes is not something
I'm clear on though.

>
> But please do get me wrong,  I do not want to block any improvement.
>
> But it seems to me that this thread focused on creating nice crypto API interface
> for FDE IVs instead of demonstration that the proposed solution really solves
> the performance issue.
> And not only for your hw driver, maybe other systems could benefit from the better
> processing of small requests as well.
>

Of course, the benefits at large needs to outweigh the cost. But I
don't think functioning
better when working on large bursts is in any way special to specific HW.

Indeed, I wonder if we can show a benefit for just cryptd use case.
I'll look into that.

Thanks,
Gilad



-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Milan Broz @ 2017-03-01 13:04 UTC (permalink / raw)
  To: Gilad Ben-Yossef, Milan Broz
  Cc: Binoy Jayan, Rajendra, Herbert Xu, Oded, Mike Snitzer,
	Linux kernel mailing list, Ondrej Mosnacek, linux-raid, dm-devel,
	Mark Brown, Arnd Bergmann, linux-crypto, Shaohua Li,
	David S. Miller, Alasdair Kergon, Ofir
In-Reply-To: <CAOtvUMePYT7OJDKL7i0y3y6Jvqsxyx6Je6g9aGKVq6PLZ_-Z8w@mail.gmail.com>

On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote:
...

> I can certainly understand if you don't wont to take the patch until
> we have results with
> dm-crypt itself but the difference between 8 separate invocation of
> the engine for 512
> bytes of XTS and a single invocation for 4KB are pretty big.

Yes, I know it. But the same can be achieved if we just implement
4k sector encryption in dmcrypt. It is incompatible with LUKS1
(but next LUKS version will support it) but I think this is not
a problem for now.

If the underlying device supports atomic write of 4k sectors, then
there should not be a problem.

This is one of the speed-up I would like to compare with the IV approach,
because everyone should benefit from 4k sectors in the end.
And no crypto API changes are needed here.

(I have an old patch for this, so I will try to revive it.)

Milan

^ permalink raw reply

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
From: ian_bruce @ 2017-03-01 13:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87mvd6oy8n.fsf@notabene.neil.brown.name>

On Wed, 01 Mar 2017 07:29:28 +1100
NeilBrown <neilb@suse.com> wrote:

> Thanks.  That makes it easy.
> Test works with my patch applied.

Thanks for fixing that.

If anybody is curious, the application for this capability is as
follows. For live systems running from a USB flashdrive, we need to
loop-mount an ext4 filesystem image from the fat32-formatted flashdrive.
Unfortunately, the maximum file size on fat32 is 4GB, which is a severe
limitation, when 128GB flashdrives are commonly available.

The solution is to split the ext4 image into multiple sub-4GB chunks,
associate a /dev/loop device with each of those files, have mdadm turn
those into a single RAID device, and mount that as the ext4 filesystem.
It is preferable to use non-metadata, linear-mode RAID for this, because
we can then convert back and forth between the single filesystem image
and its constituent chunks using the non-privileged utilities "cat" and
"split". With a maximum of 27 RAID component devices, the maximum
filesystem size would be 108GB, which is not quite a complete solution.

On Fri, 24 Feb 2017 15:46:19 -0500
Phil Turmel <philip@turmel.org> wrote:

> Note that build mode doesn't support a bunch of other MD raid features
> either, like all of the parity raid levels. That it doesn't support
> v1+ metadata isn't a surprise, and isn't the only legacy feature that
> only uses legacy metadata (built-in kernel auto-assembly gets the most
> whining, actually).

> If you think its trivial to implement --build with v1.x metadata, go
> right ahead. Post your patches for review.

I haven't tested "mdadm --build" with parity RAID myself (although the
/dev/loop trick would probably suffice for that too), but if this is so,
would the change to provide that be as simple as the patch to remove the
27-component limitation? (Although I suppose that unlike linear mode,
the component devices for parity mode would have to be initialized with
consistent data, first.)

Somebody might find a use for non-metadata, parity-mode RAID, if it were
available.

-- Ian Bruce

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Ondrej Mosnacek @ 2017-03-01 13:21 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Milan Broz, Binoy Jayan, Oded, Ofir, Herbert Xu, David S. Miller,
	linux-crypto, Mark Brown, Arnd Bergmann,
	Linux kernel mailing list, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <CAOtvUMePYT7OJDKL7i0y3y6Jvqsxyx6Je6g9aGKVq6PLZ_-Z8w@mail.gmail.com>

2017-03-01 13:42 GMT+01:00 Gilad Ben-Yossef <gilad@benyossef.com>:
> It really is an observation about overhead of context switches between
> dm-crypt and
> whatever/wherever you handle crypto - be it an off CPU hardware engine
> or a bunch
> of parallel kernel threads running on other cores. You really want to
> burst as much as
> possible.

[...]

>> For XTS you need just simple linear IV. No problem with that, implementation
>> in crypto API and hw is trivial.
>>
>> But for compatible IV (that provides compatibility with loopAES and very old TrueCrypt),
>> these should be never ever implemented again anywhere.
>
>>
>> Specifically "tcw" is broken, insecure and provided here just to help people to migrate
>> from old Truecrypt containers. Even Truecrypt followers removed it from the codebase.
>> (It is basically combination of IV and slight modification of CBC mode. All
>> recent version switched to XTS and plain IV.)
>>
>> So building abstraction over something known to be broken and that is now intentionally
>> isolated inside dmcrypt is, in my opinion, really not a good idea.
>>
>
> I don't think anyone is interested in these modes. How do you support
> XTS and essiv in
> a generic way without supporting this broken modes is not something
> I'm clear on though.

Wouldn't adopting a bulk request API (something like what I tried to
do here [1]) that allows users to supply multiple messages, each with
their own IV, fulfill this purpose? That way, we wouldn't need to
introduce any new modes into Crypto API and the drivers/accelerators
would only need to provide bulk implementations of common modes
(xts(aes), cbc(aes), ...) to provide better performance for dm-crypt
(and possibly other users, too).

I'm not sure how exactly these crypto accelerators work, but wouldn't
it help if the drivers simply get more messages (in our case sectors)
in a single call? I wonder, would (efficiently) supporting such a
scheme require changes in the HW itself or could it be achieved just
by modifying driver code (let's say specifically for your CryptoCell
accelerator)?

Thanks,
Ondrej

[1] https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg23007.html

>
> Thanks,
> Gilad
>
>
>
> --
> Gilad Ben-Yossef
> Chief Coffee Drinker
>
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru

^ permalink raw reply

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
From: Wols Lists @ 2017-03-01 15:02 UTC (permalink / raw)
  To: Phil Turmel, ian_bruce, linux-raid
In-Reply-To: <657e80e9-b1f5-1f58-a4d0-6cbc4cc44927@turmel.org>

On 26/02/17 00:07, Phil Turmel wrote:
>> Because to do so doesn't make sense? Or because nobody's bothered to do
>> > it? I get grumpy when people implement corner cases without bothering to
>> > implement the logically sensible options - bit like those extremely
>> > annoying dialog boxes that give you three choices, "yes", "no", "yes to
>> > all". What about no to all?

> Because while disconnected, and the array begins accumulating
> write-intent bits indicating where any disconnected device is out of
> date, the array has no way to know what writes are happening to that
> member.  And therefore any re-add will introduce unknowable corruptions.
>  There is no way to control what writes happen to that member, and
> drives don't naturally keep a log of writes that have happened.  The data to
> safely do what you want simply doesn't exist.  Your only known safe
> choice is to disable write-intent bitmaps, forcing complete resync on
> --re-add.

Sorry to drag this up again, but where are these write intent bits going
to come from? And it's a backup. Why am I going to re-add, unless I'm
going to wipe the old backup and create a new one?
> 
>> > I feel like mirror-raid is perfect for doing backups.

> Your feelings are wrong.  Sorry.  LVM is the perfect tool because it
> entirely controls the snapshot and doesn't have to re-add it.
> 
I think we're talking at cross-purposes here :-) You're talking about
creating a snapshot and backing it up. I'm talking about creating a
mirror, which IS the backup.

VERY different technique, same end result.

And your way is more complicated - more room for sys-admin cock-up :-)

>> > I take your point
>> > that linux hasn't implemented that feature (particularly well), but
>> > surely it's a feature that *should* be there. I know I know - "patches
>> > welcome" :-)

> Good luck creating the necessary data from thin air.  It's not a
> question of writing patches.
> 
mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
... hotplug sd-big ...
madam /dev/mdbackup --add /dev/sd-big
... wait for sync to finish ...
mdadm --stop mdbackup
... unplug sd-big ...

You've made me think about it deeper than before - thanks - and I can
think of at least one potential show-stopper, but write-intent bitmaps
and missing raid data are most definitely not it :-)

And why do I think my way is "better" (for certain values of "better"
:-) - because your way only works if it was planned in advance. My way -
if the show stopper isn't - will work on ANY running system whether
planned or not. That said, my problem probably is a show stopper :-(

Cheers,
Wol

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Milan Broz @ 2017-03-01 15:38 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Binoy Jayan, Oded, Ofir, Herbert Xu, David S. Miller,
	linux-crypto, Mark Brown, Arnd Bergmann,
	Linux kernel mailing list, Alasdair Kergon, Mike Snitzer,
	dm-devel, Shaohua Li, linux-raid, Rajendra, Ondrej Mosnacek
In-Reply-To: <2aef6e54-805f-e09b-ae66-c198f8c05335@gmail.com>


On 03/01/2017 02:04 PM, Milan Broz wrote:
> On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote:
> ...
> 
>> I can certainly understand if you don't wont to take the patch until
>> we have results with
>> dm-crypt itself but the difference between 8 separate invocation of
>> the engine for 512
>> bytes of XTS and a single invocation for 4KB are pretty big.
> 
> Yes, I know it. But the same can be achieved if we just implement
> 4k sector encryption in dmcrypt. It is incompatible with LUKS1
> (but next LUKS version will support it) but I think this is not
> a problem for now.
> 
> If the underlying device supports atomic write of 4k sectors, then
> there should not be a problem.
> 
> This is one of the speed-up I would like to compare with the IV approach,
> because everyone should benefit from 4k sectors in the end.
> And no crypto API changes are needed here.
> 
> (I have an old patch for this, so I will try to revive it.)

If anyone interested, simple experimental patch for larger sector size
(up to the page size) for dmcrypt is in this branch:

http://git.kernel.org/cgit/linux/kernel/git/mbroz/linux.git/log/?h=dm-crypt-4k-sector

It would be nice to check what performance gain could be provided
by this simple approach.

Milan

^ permalink raw reply

* Re: [PATCH v4 0/7] Partial Parity Log for MD RAID 5
From: Artur Paszkiewicz @ 2017-03-01 16:12 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170228013504.5zlzaaxsvboow6hj@kernel.org>

On 02/28/2017 02:35 AM, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 08:43:54PM +0100, Artur Paszkiewicz wrote:
>> This series of patches implements the Partial Parity Log for RAID5 arrays. The
>> purpose of this feature is closing the RAID 5 Write Hole. It is a solution
>> alternative to the existing raid5-cache, but the logging workflow and much of
>> the implementation is based on it.
>>
>> The main differences compared to raid5-cache is that PPL is a distributed log -
>> it is stored on array member drives in the metadata area and does not require a
>> dedicated journaling drive. Write performance is reduced by up to 30%-40% but
>> it scales with the number of drives in the array and the journaling drive does
>> not become a bottleneck or a single point of failure. PPL does not protect from
>> losing in-flight data, only from silent data corruption. More details about how
>> the log works can be found in patches 3 and 5.
>>
>> This feature originated from Intel RSTe, which uses IMSM metadata. This
>> patchset implements PPL for external metadata (specifically IMSM) as well as
>> native MD v1.x metadata.
>>
>> Changes in mdadm are also required to make this fully usable. Patches for mdadm
>> will be sent later.
> 
> Generally looks ok, I replied separatly. I'd like to make sure the new format
> will be supported by Intel RSTe, and intel will continue working on this.

PPL for IMSM is going to be included in RSTe implementations starting
with the upcoming Xeon platforms and Intel will continue supporting and
maintaining it.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v4 2/7] raid5: calculate partial parity for a stripe
From: Artur Paszkiewicz @ 2017-03-01 16:12 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170227234514.3bo5jxf35ie7vhor@kernel.org>

On 02/28/2017 12:45 AM, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 08:43:56PM +0100, Artur Paszkiewicz wrote:
>> Attach a page for holding the partial parity data to stripe_head.
>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>
>> Partial parity is the xor of not modified data chunks of a stripe and is
>> calculated as follows:
>>
>> - reconstruct-write case:
>>   xor data from all not updated disks in a stripe
>>
>> - read-modify-write case:
>>   xor old data and parity from all updated disks in a stripe
>>
>> Implement it using the async_tx API and integrate into raid_run_ops().
>> It must be called when we still have access to old data, so do it when
>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>> stored into sh->ppl_page.
>>
>> Partial parity is not meaningful for full stripe write and is not stored
>> in the log or used for recovery, so don't attempt to calculate it when
>> stripe has STRIPE_FULL_WRITE.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.h |   3 ++
>>  2 files changed, 103 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 7b7722bb2e8d..02e02fe5b04e 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -463,6 +463,11 @@ static void shrink_buffers(struct stripe_head *sh)
>>  		sh->dev[i].page = NULL;
>>  		put_page(p);
>>  	}
>> +
>> +	if (sh->ppl_page) {
>> +		put_page(sh->ppl_page);
>> +		sh->ppl_page = NULL;
>> +	}
>>  }
>>  
>>  static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>> @@ -479,6 +484,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>>  		sh->dev[i].page = page;
>>  		sh->dev[i].orig_page = page;
>>  	}
>> +
>> +	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {
>> +		sh->ppl_page = alloc_page(gfp);
>> +		if (!sh->ppl_page)
>> +			return 1;
>> +	}
>> +
>>  	return 0;
>>  }
>>  
>> @@ -1974,6 +1986,55 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
>>  			   &sh->ops.zero_sum_result, percpu->spare_page, &submit);
>>  }
>>  
>> +static struct dma_async_tx_descriptor *
>> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
>> +		       struct dma_async_tx_descriptor *tx)
>> +{
>> +	int disks = sh->disks;
>> +	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
>> +	int count = 0, pd_idx = sh->pd_idx, i;
>> +	struct async_submit_ctl submit;
>> +
>> +	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
>> +
>> +	/*
>> +	 * Partial parity is the XOR of stripe data chunks that are not changed
>> +	 * during the write request. Depending on available data
>> +	 * (read-modify-write vs. reconstruct-write case) we calculate it
>> +	 * differently.
>> +	 */
>> +	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
>> +		/* rmw: xor old data and parity from updated disks */
>> +		for (i = disks; i--;) {
>> +			struct r5dev *dev = &sh->dev[i];
>> +			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
>> +				xor_srcs[count++] = dev->page;
>> +		}
>> +	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
>> +		/* rcw: xor data from all not updated disks */
>> +		for (i = disks; i--;) {
>> +			struct r5dev *dev = &sh->dev[i];
>> +			if (test_bit(R5_UPTODATE, &dev->flags))
>> +				xor_srcs[count++] = dev->page;
>> +		}
>> +	} else {
>> +		return tx;
>> +	}
>> +
>> +	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
>> +			  flex_array_get(percpu->scribble, 0)
>> +			  + sizeof(struct page *) * (sh->disks + 2));
>> +
>> +	if (count == 1)
>> +		tx = async_memcpy(sh->ppl_page, xor_srcs[0], 0, 0, PAGE_SIZE,
>> +				  &submit);
>> +	else
>> +		tx = async_xor(sh->ppl_page, xor_srcs, 0, count, PAGE_SIZE,
>> +			       &submit);
>> +
>> +	return tx;
>> +}
> 
> Can you put this function to raid5-ppl.c? I'd like to keep all the log related
> out raid5.c if possible.

Sure, I'll move it if you prefer that, but I thought that it's good to
have all the ops_run_ functions in one place.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v4 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: Artur Paszkiewicz @ 2017-03-01 16:13 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170228000419.rljabgaurh7e3rqq@kernel.org>

On 02/28/2017 01:04 AM, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 08:43:57PM +0100, Artur Paszkiewicz wrote:
>> This implements the PPL write logging functionality. The description
>> of PPL is added to the documentation. More details can be found in the
>> comments in raid5-ppl.c.
>>
>> Put the PPL metadata structures to md_p.h because userspace tools
>> (mdadm) will also need to read/write PPL.
>>
>> Warn about using PPL with enabled disk volatile write-back cache for
>> now. It can be removed once disk cache flushing before writing PPL is
>> implemented.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Documentation/md/raid5-ppl.txt |  44 +++
>>  drivers/md/Makefile            |   2 +-
>>  drivers/md/raid5-ppl.c         | 617 +++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c             |  49 +++-
>>  drivers/md/raid5.h             |   8 +
>>  include/uapi/linux/raid/md_p.h |  26 ++
>>  6 files changed, 738 insertions(+), 8 deletions(-)
>>  create mode 100644 Documentation/md/raid5-ppl.txt
>>  create mode 100644 drivers/md/raid5-ppl.c
>>
>> diff --git a/Documentation/md/raid5-ppl.txt b/Documentation/md/raid5-ppl.txt
>> new file mode 100644
>> index 000000000000..127072b09363
>> --- /dev/null
>> +++ b/Documentation/md/raid5-ppl.txt
>> @@ -0,0 +1,44 @@
>> +Partial Parity Log
>> +
>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
>> +may become inconsistent with data on other member disks. If the array is also
>> +in degraded state, there is no way to recalculate parity, because one of the
>> +disks is missing. This can lead to silent data corruption when rebuilding the
>> +array or using it is as degraded - data calculated from parity for array blocks
>> +that have not been touched by a write request during the unclean shutdown can
>> +be incorrect. Such condition is known as the RAID5 Write Hole. Because of
>> +this, md by default does not allow starting a dirty degraded array.
>> +
>> +Partial parity for a write operation is the XOR of stripe data chunks not
>> +modified by this write. It is just enough data needed for recovering from the
>> +write hole. XORing partial parity with the modified chunks produces parity for
>> +the stripe, consistent with its state before the write operation, regardless of
>> +which chunk writes have completed. If one of the not modified data disks of
>> +this stripe is missing, this updated parity can be used to recover its
>> +contents. PPL recovery is also performed when starting an array after an
>> +unclean shutdown and all disks are available, eliminating the need to resync
>> +the array. Because of this, using write-intent bitmap and PPL together is not
>> +supported.
>> +
>> +When handling a write request PPL writes partial parity before new data and
>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
>> +array member drives in the metadata area, on the parity drive of a particular
>> +stripe.  It does not require a dedicated journaling drive. Write performance is
>> +reduced by up to 30%-40% but it scales with the number of drives in the array
>> +and the journaling drive does not become a bottleneck or a single point of
>> +failure.
>> +
>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
>> +not a true journal. It does not protect from losing in-flight data, only from
>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
>> +performed for this stripe (parity is not updated). So it is possible to have
>> +arbitrary data in the written part of a stripe if that disk is lost. In such
>> +case the behavior is the same as in plain raid5.
>> +
>> +PPL is available for md version-1 metadata and external (specifically IMSM)
>> +metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
>> +
>> +Currently, volatile write-back cache should be disabled on all member drives
>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
>> +failure.
>> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
>> index 3cbda1af87a0..4d48714ccc6b 100644
>> --- a/drivers/md/Makefile
>> +++ b/drivers/md/Makefile
>> @@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
>>  dm-era-y	+= dm-era-target.o
>>  dm-verity-y	+= dm-verity-target.o
>>  md-mod-y	+= md.o bitmap.o
>> -raid456-y	+= raid5.o raid5-cache.o
>> +raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
>>  
>>  # Note: link order is important.  All raid personalities
>>  # and must come before md.o, as they each initialise 
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> new file mode 100644
>> index 000000000000..a00cabf1adf6
>> --- /dev/null
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -0,0 +1,617 @@
>> +/*
>> + * Partial Parity Log for closing the RAID5 write hole
>> + * Copyright (c) 2017, Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/slab.h>
>> +#include <linux/crc32c.h>
>> +#include <linux/raid/md_p.h>
>> +#include "md.h"
>> +#include "raid5.h"
>> +
>> +/*
>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
>> + * partial parity data. The header contains an array of entries
>> + * (struct ppl_header_entry) which describe the logged write requests.
>> + * Partial parity for the entries comes after the header, written in the same
>> + * sequence as the entries:
>> + *
>> + * Header
>> + *   entry0
>> + *   ...
>> + *   entryN
>> + * PP data
>> + *   PP for entry0
>> + *   ...
>> + *   PP for entryN
>> + *
>> + * Every entry holds a checksum of its partial parity, the header also has a
>> + * checksum of the header itself. Entries for full stripes writes contain no
>> + * partial parity, they only mark the stripes for which parity should be
>> + * recalculated after an unclean shutdown.
> 
> Please describe the disk format in details. I had trouble to understand
> ppl_log_stripe() and ppl_submit_iounit(), things like the ppl merge, full
> stripes handling.

OK, I will improve it, but more information about this is also in patch 5.
 
>> + *
>> + * A write request is always logged to the PPL instance stored on the parity
>> + * disk of the corresponding stripe. For each member disk there is one ppl_log
>> + * used to handle logging for this disk, independently from others. They are
>> + * grouped in child_logs array in struct ppl_conf, which is assigned to
>> + * r5conf->ppl and used in raid5 core.
>> + *
>> + * ppl_io_unit represents a full PPL write, header_page contains the ppl_header.
>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
>> + * be appended to the last entry if the chunks to write are the same, otherwise
>> + * a new entry is added. Checksums of entries are calculated incrementally as
>> + * stripes containing partial parity are being added to entries.
>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
>> + * containing the header page and partial parity pages (sh->ppl_page) for all
>> + * stripes of the io_unit. When the PPL write completes, the stripes associated
>> + * with the io_unit are released and raid5d starts writing their data and
>> + * parity. When all stripes are written, the io_unit is freed and the next can
>> + * be submitted.
>> + *
>> + * An io_unit is used to gather stripes until it is submitted or becomes full
>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
>> + * can't be submitted until the previous has completed (PPL and stripe
>> + * data+parity is written). The log->io_list tracks all io_units of a log
>> + * (for a single member disk). New io_units are added to the end of the list
>> + * and the first io_unit is submitted, if it is not submitted already.
>> + * The current io_unit accepting new stripes is always at the end of the list.
>> + */
>> +
>> +struct ppl_conf {
>> +	struct mddev *mddev;
>> +
>> +	/* array of child logs, one for each raid disk */
>> +	struct ppl_log *child_logs;
>> +	int count;
>> +
>> +	int block_size;		/* the logical block size used for data_sector
>> +				 * in ppl_header_entry */
>> +	u32 signature;		/* raid array identifier */
>> +	atomic64_t seq;		/* current log write sequence number */
>> +
>> +	struct kmem_cache *io_kc;
>> +	mempool_t *io_pool;
>> +	struct bio_set *bs;
>> +	mempool_t *meta_pool;
>> +};
>> +
>> +struct ppl_log {
>> +	struct ppl_conf *ppl_conf;	/* shared between all log instances */
>> +
>> +	struct md_rdev *rdev;		/* array member disk associated with
>> +					 * this log instance */
>> +	struct mutex io_mutex;
>> +	struct ppl_io_unit *current_io;	/* current io_unit accepting new data
>> +					 * always at the end of io_list */
>> +	spinlock_t io_list_lock;
>> +	struct list_head io_list;	/* all io_units of this log */
>> +	struct list_head no_mem_stripes;/* stripes to retry if failed to
>> +					 * allocate io_unit */
>> +};
>> +
>> +struct ppl_io_unit {
>> +	struct ppl_log *log;
>> +
>> +	struct page *header_page;	/* for ppl_header */
>> +
>> +	unsigned int entries_count;	/* number of entries in ppl_header */
>> +	unsigned int pp_size;		/* total size current of partial parity */
>> +
>> +	u64 seq;			/* sequence number of this log write */
>> +	struct list_head log_sibling;	/* log->io_list */
>> +
>> +	struct list_head stripe_list;	/* stripes added to the io_unit */
>> +	atomic_t pending_stripes;	/* how many stripes not written to raid */
>> +
>> +	bool submitted;			/* true if write to log started */
>> +};
>> +
>> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>> +					  struct stripe_head *sh)
>> +{
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct ppl_io_unit *io;
>> +	struct ppl_header *pplhdr;
>> +
>> +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
>> +	if (!io)
>> +		return NULL;
>> +
>> +	memset(io, 0, sizeof(*io));
>> +	io->log = log;
>> +	INIT_LIST_HEAD(&io->log_sibling);
>> +	INIT_LIST_HEAD(&io->stripe_list);
>> +	atomic_set(&io->pending_stripes, 0);
>> +
>> +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
>> +	pplhdr = page_address(io->header_page);
>> +	clear_page(pplhdr);
>> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
>> +	pplhdr->signature = cpu_to_le32(ppl_conf->signature);
>> +
>> +	io->seq = atomic64_add_return(1, &ppl_conf->seq);
>> +	pplhdr->generation = cpu_to_le64(io->seq);
>> +
>> +	return io;
>> +}
>> +
>> +static int ppl_log_stripe(struct ppl_log *log, struct stripe_head *sh)
>> +{
>> +	struct ppl_io_unit *io = log->current_io;
>> +	struct ppl_header_entry *e = NULL;
>> +	struct ppl_header *pplhdr;
>> +	int i;
>> +	sector_t data_sector = 0;
>> +	int data_disks = 0;
>> +	unsigned int entry_space = (log->rdev->ppl.size << 9) - PPL_HEADER_SIZE;
>> +	struct r5conf *conf = sh->raid_conf;
>> +
>> +	pr_debug("%s: stripe: %llu\n", __func__, (unsigned long long)sh->sector);
>> +
>> +	/* check if current io_unit is full */
>> +	if (io && (io->pp_size == entry_space ||
>> +		   io->entries_count == PPL_HDR_MAX_ENTRIES)) {
>> +		pr_debug("%s: add io_unit blocked by seq: %llu\n",
>> +			 __func__, io->seq);
>> +		io = NULL;
>> +	}
>> +
>> +	/* add a new unit if there is none or the current is full */
>> +	if (!io) {
>> +		io = ppl_new_iounit(log, sh);
>> +		if (!io)
>> +			return -ENOMEM;
>> +		spin_lock_irq(&log->io_list_lock);
>> +		list_add_tail(&io->log_sibling, &log->io_list);
>> +		spin_unlock_irq(&log->io_list_lock);
>> +
>> +		log->current_io = io;
>> +	}
>> +
>> +	for (i = 0; i < sh->disks; i++) {
>> +		struct r5dev *dev = &sh->dev[i];
>> +
>> +		if (i != sh->pd_idx && test_bit(R5_Wantwrite, &dev->flags)) {
>> +			if (!data_disks || dev->sector < data_sector)
>> +				data_sector = dev->sector;
>> +			data_disks++;
>> +		}
>> +	}
>> +	BUG_ON(!data_disks);
>> +
>> +	pr_debug("%s: seq: %llu data_sector: %llu data_disks: %d\n", __func__,
>> +		 io->seq, (unsigned long long)data_sector, data_disks);
>> +
>> +	pplhdr = page_address(io->header_page);
>> +
>> +	if (io->entries_count > 0) {
>> +		struct ppl_header_entry *last =
>> +				&pplhdr->entries[io->entries_count - 1];
>> +		u64 data_sector_last = le64_to_cpu(last->data_sector);
>> +		u32 data_size_last = le32_to_cpu(last->data_size);
>> +		u32 pp_size_last = le32_to_cpu(last->pp_size);
>> +
>> +		/*
>> +		 * Check if we can merge with the last entry. Must be on
>> +		 * the same stripe and disks. Use bit shift and logarithm
>> +		 * to avoid 64-bit division.
>> +		 */
>> +		if ((data_sector >> ilog2(conf->chunk_sectors) ==
>> +		     data_sector_last >> ilog2(conf->chunk_sectors)) &&
>> +		    ((pp_size_last == 0 &&
>> +		      test_bit(STRIPE_FULL_WRITE, &sh->state)) ||
>> +		     ((data_sector_last + (pp_size_last >> 9) == data_sector) &&
>> +		      (data_size_last == pp_size_last * data_disks))))
> 
> please explain this part

This checks if the last and currently logged stripe_head are on the same
stripe and the disks to write are the same. The conditions are probably
more complicated than they should be, I'll try to simplify it.
 
>> +			e = last;
>> +	}
>> +
>> +	if (!e) {
>> +		e = &pplhdr->entries[io->entries_count++];
>> +		e->data_sector = cpu_to_le64(data_sector);
> 
> Don't really understand what's the data_sector for. Should this be stripe->sector?

This is the first raid sector of the ppl entry. It can be equal to
stripe->sector if the write starts at the first disk of the stripe. So
generally it is the sector of the first device in the stripe with
R5_Wantwrite of the first stripe in the ppl entry.
 
>> +		e->parity_disk = cpu_to_le32(sh->pd_idx);
>> +		e->checksum = cpu_to_le32(~0);
>> +	}
>> +
>> +	le32_add_cpu(&e->data_size, data_disks << PAGE_SHIFT);
>> +
>> +	/* don't write any PP if full stripe write */
>> +	if (!test_bit(STRIPE_FULL_WRITE, &sh->state)) {
>> +		le32_add_cpu(&e->pp_size, PAGE_SIZE);
>> +		io->pp_size += PAGE_SIZE;
>> +		e->checksum = cpu_to_le32(crc32c_le(le32_to_cpu(e->checksum),
>> +						    page_address(sh->ppl_page),
>> +						    PAGE_SIZE));
>> +	}
>> +
>> +	list_add_tail(&sh->log_list, &io->stripe_list);
>> +	atomic_inc(&io->pending_stripes);
>> +	sh->ppl_log_io = io;
>> +
>> +	return 0;
>> +}
>> +
>> +int ppl_write_stripe(struct ppl_conf *ppl_conf, struct stripe_head *sh)
>> +{
>> +	struct ppl_log *log;
>> +	struct ppl_io_unit *io = sh->ppl_log_io;
>> +
>> +	if (io || test_bit(STRIPE_SYNCING, &sh->state) ||
>> +	    !test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags) ||
>> +	    !test_bit(R5_Insync, &sh->dev[sh->pd_idx].flags)) {
>> +		clear_bit(STRIPE_LOG_TRAPPED, &sh->state);
>> +		return -EAGAIN;
>> +	}
>> +
>> +	log = &ppl_conf->child_logs[sh->pd_idx];
>> +
>> +	mutex_lock(&log->io_mutex);
>> +
>> +	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
> 
> is the log->rdev check required?

Yes, log->rdev can be null if the member drive for this stripe's parity
is missing. Patch 6 ("support disk hot add/remove with PPL") makes it
more apparent.
 
>> +		mutex_unlock(&log->io_mutex);
>> +		return -EAGAIN;
>> +	}
>> +
>> +	set_bit(STRIPE_LOG_TRAPPED, &sh->state);
>> +	clear_bit(STRIPE_DELAYED, &sh->state);
>> +	atomic_inc(&sh->count);
>> +
>> +	if (ppl_log_stripe(log, sh)) {
>> +		spin_lock_irq(&log->io_list_lock);
>> +		list_add_tail(&sh->log_list, &log->no_mem_stripes);
>> +		spin_unlock_irq(&log->io_list_lock);
>> +	}
>> +
>> +	mutex_unlock(&log->io_mutex);
>> +
>> +	return 0;
>> +}
>> +
>> +static void ppl_log_endio(struct bio *bio)
>> +{
>> +	struct ppl_io_unit *io = bio->bi_private;
>> +	struct ppl_log *log = io->log;
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct stripe_head *sh, *next;
>> +
>> +	pr_debug("%s: seq: %llu\n", __func__, io->seq);
>> +
>> +	if (bio->bi_error)
>> +		md_error(ppl_conf->mddev, log->rdev);
>> +
>> +	bio_put(bio);
>> +	mempool_free(io->header_page, ppl_conf->meta_pool);
>> +
>> +	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
>> +		list_del_init(&sh->log_list);
>> +
>> +		set_bit(STRIPE_HANDLE, &sh->state);
>> +		raid5_release_stripe(sh);
>> +	}
>> +}
>> +
>> +static void ppl_submit_iounit(struct ppl_io_unit *io)
>> +{
>> +	struct ppl_log *log = io->log;
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct r5conf *conf = ppl_conf->mddev->private;
>> +	struct ppl_header *pplhdr = page_address(io->header_page);
>> +	struct bio *bio;
>> +	struct stripe_head *sh;
>> +	int i;
>> +	struct bio_list bios = BIO_EMPTY_LIST;
>> +	char b[BDEVNAME_SIZE];
>> +
>> +	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, ppl_conf->bs);
>> +	bio->bi_private = io;
>> +	bio->bi_end_io = ppl_log_endio;
>> +	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
>> +	bio->bi_bdev = log->rdev->bdev;
>> +	bio->bi_iter.bi_sector = log->rdev->ppl.sector;
>> +	bio_add_page(bio, io->header_page, PAGE_SIZE, 0);
>> +	bio_list_add(&bios, bio);
>> +
>> +	sh = list_first_entry(&io->stripe_list, struct stripe_head, log_list);
>> +
>> +	for (i = 0; i < io->entries_count; i++) {
>> +		struct ppl_header_entry *e = &pplhdr->entries[i];
>> +		u32 pp_size = le32_to_cpu(e->pp_size);
>> +		u32 data_size = le32_to_cpu(e->data_size);
>> +		u64 data_sector = le64_to_cpu(e->data_sector);
>> +		int stripes_count;
>> +
>> +		if (pp_size > 0)
>> +			stripes_count = pp_size >> PAGE_SHIFT;
>> +		else
>> +			stripes_count = (data_size /
>> +					 (conf->raid_disks -
>> +					  conf->max_degraded)) >> PAGE_SHIFT;
> 
> please explain this part

The number of stripes of a ppl entry is equal to the size of its partial
parity divided by 4k, except when the stripes are STRIPE_FULL_WRITE -
then there is no partial parity, but the number of stripes is then equal
to the entry's data size / array data disks / 4k.

>> +
>> +		while (stripes_count--) {
>> +			/*
>> +			 * if entry without partial parity just skip its stripes
>> +			 * without adding pages to bio
>> +			 */
>> +			if (pp_size > 0 &&
>> +			    !bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0)) {
>> +				struct bio *prev = bio;
>> +
>> +				bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES,
>> +						       ppl_conf->bs);
> 
> The code keeps allocating bios but doesn't dispatch any yet. The bioset can't
> guarantee the allocation successes. To have the guarantee, we must make sure
> previous bios could finish.

Is it ok to submit the bio before allocating another from the same
bioset, without waiting for it to complete? The description for
bio_alloc_bioset() suggests this, if I understand it correctly: "Callers
that need to allocate more than 1 bio must always submit the previously
allocated bio for IO before attempting to allocate a new one." 
 
>> +				bio->bi_opf = prev->bi_opf;
>> +				bio->bi_bdev = prev->bi_bdev;
>> +				bio->bi_iter.bi_sector = bio_end_sector(prev);
>> +				bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
>> +				bio_chain(bio, prev);
>> +				bio_list_add(&bios, bio);
>> +			}
>> +			sh = list_next_entry(sh, log_list);
>> +		}
>> +
>> +		pr_debug("%s: seq: %llu entry: %d data_sector: %llu pp_size: %u data_size: %u\n",
>> +			 __func__, io->seq, i, data_sector, pp_size, data_size);
>> +
>> +		e->data_sector = cpu_to_le64(data_sector >>
>> +					     ilog2(ppl_conf->block_size >> 9));
>> +		e->checksum = cpu_to_le32(~le32_to_cpu(e->checksum));
>> +	}
>> +
>> +	pplhdr->entries_count = cpu_to_le32(io->entries_count);
>> +	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PPL_HEADER_SIZE));
> 
> so the checksum is ~crc32c, right? please doc in ppl_header_entry definition

Ok, I will add it to the descriptions for both header and entry checksums.

>> +
>> +	while ((bio = bio_list_pop(&bios))) {
>> +		pr_debug("%s: seq: %llu submit_bio() size: %u sector: %llu dev: %s\n",
>> +			 __func__, io->seq, bio->bi_iter.bi_size,
>> +			 (unsigned long long)bio->bi_iter.bi_sector,
>> +			 bdevname(bio->bi_bdev, b));
>> +		submit_bio(bio);
>> +	}
>> +}
>> +
>> +static void ppl_submit_current_io(struct ppl_log *log)
>> +{
>> +	struct ppl_io_unit *io;
>> +
>> +	spin_lock_irq(&log->io_list_lock);
>> +
>> +	io = list_first_entry_or_null(&log->io_list, struct ppl_io_unit,
>> +				      log_sibling);
>> +	if (io && io->submitted)
>> +		io = NULL;
>> +
>> +	spin_unlock_irq(&log->io_list_lock);
>> +
>> +	if (io) {
>> +		io->submitted = true;
>> +
>> +		if (io == log->current_io)
>> +			log->current_io = NULL;
>> +
>> +		ppl_submit_iounit(io);
>> +	}
>> +}
>> +
>> +void ppl_write_stripe_run(struct ppl_conf *ppl_conf)
>> +{
>> +	struct ppl_log *log;
>> +	int i;
>> +
>> +	for (i = 0; i < ppl_conf->count; i++) {
>> +		log = &ppl_conf->child_logs[i];
>> +
>> +		mutex_lock(&log->io_mutex);
>> +		ppl_submit_current_io(log);
>> +		mutex_unlock(&log->io_mutex);
>> +	}
>> +}
>> +
>> +static void ppl_io_unit_finished(struct ppl_io_unit *io)
>> +{
>> +	struct ppl_log *log = io->log;
>> +	unsigned long flags;
>> +
>> +	pr_debug("%s: seq: %llu\n", __func__, io->seq);
>> +
>> +	spin_lock_irqsave(&log->io_list_lock, flags);
>> +
>> +	list_del(&io->log_sibling);
>> +	mempool_free(io, log->ppl_conf->io_pool);
>> +
>> +	if (!list_empty(&log->no_mem_stripes)) {
>> +		struct stripe_head *sh = list_first_entry(&log->no_mem_stripes,
>> +							  struct stripe_head,
>> +							  log_list);
>> +		list_del_init(&sh->log_list);
>> +		set_bit(STRIPE_HANDLE, &sh->state);
>> +		raid5_release_stripe(sh);
>> +	}
>> +
>> +	spin_unlock_irqrestore(&log->io_list_lock, flags);
>> +}
>> +
>> +void ppl_stripe_write_finished(struct stripe_head *sh)
>> +{
>> +	struct ppl_io_unit *io;
>> +
>> +	io = sh->ppl_log_io;
>> +	sh->ppl_log_io = NULL;
>> +
>> +	if (io && atomic_dec_and_test(&io->pending_stripes))
>> +		ppl_io_unit_finished(io);
>> +}
>> +
>> +void ppl_exit_log(struct ppl_conf *ppl_conf)
>> +{
>> +	kfree(ppl_conf->child_logs);
>> +
>> +	mempool_destroy(ppl_conf->meta_pool);
>> +	if (ppl_conf->bs)
>> +		bioset_free(ppl_conf->bs);
>> +	mempool_destroy(ppl_conf->io_pool);
>> +	kmem_cache_destroy(ppl_conf->io_kc);
>> +
>> +	kfree(ppl_conf);
>> +}
>> +
>> +static int ppl_validate_rdev(struct md_rdev *rdev)
>> +{
>> +	char b[BDEVNAME_SIZE];
>> +	int ppl_data_sectors;
>> +	int ppl_size_new;
>> +
>> +	/*
>> +	 * The configured PPL size must be enough to store
>> +	 * the header and (at the very least) partial parity
>> +	 * for one stripe. Round it down to ensure the data
>> +	 * space is cleanly divisible by stripe size.
>> +	 */
>> +	ppl_data_sectors = rdev->ppl.size - (PPL_HEADER_SIZE >> 9);
>> +
>> +	if (ppl_data_sectors > 0)
>> +		ppl_data_sectors = rounddown(ppl_data_sectors, STRIPE_SECTORS);
>> +
>> +	if (ppl_data_sectors <= 0) {
>> +		pr_warn("md/raid:%s: PPL space too small on %s\n",
>> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
>> +		return -ENOSPC;
>> +	}
>> +
>> +	ppl_size_new = ppl_data_sectors + (PPL_HEADER_SIZE >> 9);
>> +
>> +	if ((rdev->ppl.sector < rdev->data_offset &&
>> +	     rdev->ppl.sector + ppl_size_new > rdev->data_offset) ||
>> +	    (rdev->ppl.sector >= rdev->data_offset &&
>> +	     rdev->data_offset + rdev->sectors > rdev->ppl.sector)) {
>> +		pr_warn("md/raid:%s: PPL space overlaps with data on %s\n",
>> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!rdev->mddev->external &&
>> +	    ((rdev->ppl.offset > 0 && rdev->ppl.offset < (rdev->sb_size >> 9)) ||
>> +	     (rdev->ppl.offset <= 0 && rdev->ppl.offset + ppl_size_new > 0))) {
>> +		pr_warn("md/raid:%s: PPL space overlaps with superblock on %s\n",
>> +			mdname(rdev->mddev), bdevname(rdev->bdev, b));
>> +		return -EINVAL;
>> +	}
>> +
>> +	rdev->ppl.size = ppl_size_new;
>> +
>> +	return 0;
>> +}
>> +
>> +int ppl_init_log(struct r5conf *conf)
>> +{
>> +	struct ppl_conf *ppl_conf;
>> +	struct mddev *mddev = conf->mddev;
>> +	int ret = 0;
>> +	int i;
>> +	bool need_cache_flush;
>> +
>> +	if (PAGE_SIZE != 4096)
>> +		return -EINVAL;
>> +
>> +	if (mddev->bitmap_info.file || mddev->bitmap_info.offset) {
>> +		pr_warn("md/raid:%s PPL is not compatible with bitmap\n",
>> +			mdname(mddev));
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
>> +		pr_warn("md/raid:%s PPL is not compatible with journal\n",
>> +			mdname(mddev));
>> +		return -EINVAL;
>> +	}
> 
> shouldn't these two checks be put into md.c, which we load the rdev?

Then they would not be performed for external metadata and when enabling
ppl at runtime. I think we need this in a central place where ppl is
started.

>> +
>> +	ppl_conf = kzalloc(sizeof(struct ppl_conf), GFP_KERNEL);
>> +	if (!ppl_conf)
>> +		return -ENOMEM;
>> +
>> +	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
>> +	if (!ppl_conf->io_kc) {
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
>> +	if (!ppl_conf->io_pool) {
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
>> +	if (!ppl_conf->bs) {
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
>> +	if (!ppl_conf->meta_pool) {
>> +		ret = -EINVAL;
>> +		goto err;
>> +	}
>> +
>> +	ppl_conf->mddev = mddev;
>> +	ppl_conf->count = conf->raid_disks;
>> +	ppl_conf->child_logs = kcalloc(ppl_conf->count, sizeof(struct ppl_log),
>> +				       GFP_KERNEL);
>> +	if (!ppl_conf->child_logs) {
>> +		ret = -ENOMEM;
>> +		goto err;
>> +	}
>> +
>> +	atomic64_set(&ppl_conf->seq, 0);
>> +
>> +	if (!mddev->external) {
>> +		ppl_conf->signature = ~crc32c_le(~0, mddev->uuid, sizeof(mddev->uuid));
>> +		ppl_conf->block_size = 512;
>> +	} else {
>> +		ppl_conf->block_size = queue_logical_block_size(mddev->queue);
>> +	}
> 
> Can we always set block_size 512? Don't really understand this part.

For native md arrays we can use 512 byte sectors, like in the rest of md
metadata, but for imsm the values in ppl_header_entry.data_sector must
be in logical sectors, due to compatibility with UEFI and Windows.

>> +
>> +	for (i = 0; i < ppl_conf->count; i++) {
>> +		struct ppl_log *log = &ppl_conf->child_logs[i];
>> +		struct md_rdev *rdev = conf->disks[i].rdev;
>> +
>> +		mutex_init(&log->io_mutex);
>> +		spin_lock_init(&log->io_list_lock);
>> +		INIT_LIST_HEAD(&log->io_list);
>> +		INIT_LIST_HEAD(&log->no_mem_stripes);
>> +
>> +		log->ppl_conf = ppl_conf;
>> +		log->rdev = rdev;
>> +
>> +		if (rdev) {
>> +			struct request_queue *q;
>> +
>> +			ret = ppl_validate_rdev(rdev);
>> +			if (ret)
>> +				goto err;
>> +
>> +			q = bdev_get_queue(rdev->bdev);
>> +			if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
>> +				need_cache_flush = true;
>> +		}
>> +	}
>> +
>> +	if (need_cache_flush)
>> +		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
>> +			mdname(mddev));
>> +
>> +	conf->ppl = ppl_conf;
>> +
>> +	return 0;
>> +err:
>> +	ppl_exit_log(ppl_conf);
>> +	return ret;
>> +}
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 02e02fe5b04e..21440b594878 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -739,7 +739,7 @@ static bool stripe_can_batch(struct stripe_head *sh)
>>  {
>>  	struct r5conf *conf = sh->raid_conf;
>>  
>> -	if (conf->log)
>> +	if (conf->log || conf->ppl)
>>  		return false;
> 
> Maybe add a new inline function into raid5-log.h

OK, I will move it.
 
>>  	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
>>  		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
>> @@ -936,6 +936,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>>  		}
>>  	}
>>  
>> +	if (conf->ppl && ppl_write_stripe(conf->ppl, sh) == 0)
>> +		return;
> 
> please put the one and the above raid5-cache part into a separate function into raid5-log.h
> 
>> +
>>  	for (i = disks; i--; ) {
>>  		int op, op_flags = 0;
>>  		int replace_only = 0;
>> @@ -3308,6 +3311,16 @@ static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
>>  			     &dd_idx, sh);
>>  }
>>  
>> +static void log_stripe_write_finished(struct stripe_head *sh)
>> +{
>> +	struct r5conf *conf = sh->raid_conf;
>> +
>> +	if (conf->log)
>> +		r5l_stripe_write_finished(sh);
>> +	else if (conf->ppl)
>> +		ppl_stripe_write_finished(sh);
>> +}
> 
> please put this into raid5-log.h
> 
>> +
>>  static void
>>  handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>>  				struct stripe_head_state *s, int disks,
>> @@ -3347,7 +3360,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>>  		if (bi)
>>  			bitmap_end = 1;
>>  
>> -		r5l_stripe_write_finished(sh);
>> +		log_stripe_write_finished(sh);
>>  
>>  		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
>>  			wake_up(&conf->wait_for_overlap);
>> @@ -3766,7 +3779,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>>  				discard_pending = 1;
>>  		}
>>  
>> -	r5l_stripe_write_finished(sh);
>> +	log_stripe_write_finished(sh);
>>  
>>  	if (!discard_pending &&
>>  	    test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags)) {
>> @@ -4756,7 +4769,7 @@ static void handle_stripe(struct stripe_head *sh)
>>  
>>  	if (s.just_cached)
>>  		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
>> -	r5l_stripe_write_finished(sh);
>> +	log_stripe_write_finished(sh);
>>  
>>  	/* Now we might consider reading some blocks, either to check/generate
>>  	 * parity, or to satisfy requests
>> @@ -6120,6 +6133,14 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
>>  	return handled;
>>  }
>>  
>> +static void log_write_stripe_run(struct r5conf *conf)
>> +{
>> +	if (conf->log)
>> +		r5l_write_stripe_run(conf->log);
>> +	else if (conf->ppl)
>> +		ppl_write_stripe_run(conf->ppl);
>> +}
> 
> ditto
> 
>> +
>>  static int handle_active_stripes(struct r5conf *conf, int group,
>>  				 struct r5worker *worker,
>>  				 struct list_head *temp_inactive_list)
>> @@ -6157,7 +6178,7 @@ static int handle_active_stripes(struct r5conf *conf, int group,
>>  
>>  	for (i = 0; i < batch_size; i++)
>>  		handle_stripe(batch[i]);
>> -	r5l_write_stripe_run(conf->log);
>> +	log_write_stripe_run(conf);
>>  
>>  	cond_resched();
>>  
>> @@ -6735,6 +6756,8 @@ static void free_conf(struct r5conf *conf)
>>  
>>  	if (conf->log)
>>  		r5l_exit_log(conf->log);
>> +	if (conf->ppl)
>> +		ppl_exit_log(conf->ppl);
> 
> Ditto
> 
>>  	if (conf->shrinker.nr_deferred)
>>  		unregister_shrinker(&conf->shrinker);
>>  
>> @@ -7196,6 +7219,13 @@ static int raid5_run(struct mddev *mddev)
>>  		BUG_ON(mddev->delta_disks != 0);
>>  	}
>>  
>> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
>> +	    test_bit(MD_HAS_PPL, &mddev->flags)) {
>> +		pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
>> +			mdname(mddev));
>> +		clear_bit(MD_HAS_PPL, &mddev->flags);
>> +	}
>> +
>>  	if (mddev->private == NULL)
>>  		conf = setup_conf(mddev);
>>  	else
>> @@ -7422,6 +7452,11 @@ static int raid5_run(struct mddev *mddev)
>>  			 mdname(mddev), bdevname(journal_dev->bdev, b));
>>  		if (r5l_init_log(conf, journal_dev))
>>  			goto abort;
>> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
>> +		pr_debug("md/raid:%s: enabling distributed Partial Parity Log\n",
>> +			 mdname(mddev));
>> +		if (ppl_init_log(conf))
>> +			goto abort;
>>  	}
> 
> Ditto, if possible
> 
>>  
>>  	return 0;
>> @@ -7690,7 +7725,7 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
>>  	sector_t newsize;
>>  	struct r5conf *conf = mddev->private;
>>  
>> -	if (conf->log)
>> +	if (conf->log || conf->ppl)
>>  		return -EINVAL;
>>  	sectors &= ~((sector_t)conf->chunk_sectors - 1);
>>  	newsize = raid5_size(mddev, sectors, mddev->raid_disks);
>> @@ -7743,7 +7778,7 @@ static int check_reshape(struct mddev *mddev)
>>  {
>>  	struct r5conf *conf = mddev->private;
>>  
>> -	if (conf->log)
>> +	if (conf->log || conf->ppl)
> 
> there are other places we check conf->log. like stripe_can_batch(), I think the
> ppl code can't hanel batch right now.

Yes, I added this also in stripe_can_batch(). I checked all other places
where conf->log is used and I think that everything is covered in this
patchset, but I will check again.

>>  		return -EINVAL;
>>  	if (mddev->delta_disks == 0 &&
>>  	    mddev->new_layout == mddev->layout &&
>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index 3cc4cb28f7e6..f915a7a0e752 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -230,6 +230,7 @@ struct stripe_head {
>>  	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */
>>  
>>  	struct page		*ppl_page; /* partial parity of this stripe */
>> +	struct ppl_io_unit	*ppl_log_io;
> 
> Maybe we should put the ppl fields and raid5 fields to a union

Good idea, will do.

>>  	/**
>>  	 * struct stripe_operations
>>  	 * @target - STRIPE_OP_COMPUTE_BLK target
>> @@ -689,6 +690,7 @@ struct r5conf {
>>  	int			group_cnt;
>>  	int			worker_cnt_per_group;
>>  	struct r5l_log		*log;
>> +	struct ppl_conf		*ppl;
> 
> Maybe use void * log_private, so works for both raid5-cache and ppl

Do you mean replace "struct ppl_conf *ppl" with "void *log_private"?
 
>>  
>>  	struct bio_list		pending_bios;
>>  	spinlock_t		pending_bios_lock;
>> @@ -798,4 +800,10 @@ extern void r5c_check_cached_full_stripe(struct r5conf *conf);
>>  extern struct md_sysfs_entry r5c_journal_mode;
>>  extern void r5c_update_on_rdev_error(struct mddev *mddev);
>>  extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
>> +
>> +extern int ppl_init_log(struct r5conf *conf);
>> +extern void ppl_exit_log(struct ppl_conf *log);
>> +extern int ppl_write_stripe(struct ppl_conf *log, struct stripe_head *sh);
>> +extern void ppl_write_stripe_run(struct ppl_conf *log);
>> +extern void ppl_stripe_write_finished(struct stripe_head *sh);
> 
> These part and the raid5 part should be put into raid5-log.h
> 
>>  #endif
>> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
>> index fe2112810c43..2c28711cc5f1 100644
>> --- a/include/uapi/linux/raid/md_p.h
>> +++ b/include/uapi/linux/raid/md_p.h
>> @@ -398,4 +398,30 @@ struct r5l_meta_block {
>>  
>>  #define R5LOG_VERSION 0x1
>>  #define R5LOG_MAGIC 0x6433c509
>> +
>> +struct ppl_header_entry {
>> +	__le64 data_sector;	/* Raid sector of the new data */
>> +	__le32 pp_size;		/* Length of partial parity */
>> +	__le32 data_size;	/* Length of data */
>> +	__le32 parity_disk;	/* Member disk containing parity */
>> +	__le32 checksum;	/* Checksum of this entry */
> 
> So we changed the disk format. Will this new format be put into IMSM standard?
> Do we have a version to verify the old/new format for IMSM?

Yes, this will be included in IMSM. The platforms with support for this
are not yet released, so minor changes to the format were still
possible. This can be considered the first version of the PPL metadata
format. In the future if there will be any changes I think that for
example a version field could be inserted in place of
ppl_header.padding, which is currently always set to 0.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v4 4/7] md: add sysfs entries for PPL
From: Artur Paszkiewicz @ 2017-03-01 16:13 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170228003703.iyhta33quiic6ayk@kernel.org>

On 02/28/2017 01:37 AM, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 08:43:58PM +0100, Artur Paszkiewicz wrote:
>> Add 'consistency_policy' attribute for array. It indicates how the array
>> maintains consistency in case of unexpected shutdown.
>>
>> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
>> and size of the PPL space on the device. They can't be changed for
>> active members if the array is started and PPL is enabled, so in the
>> setter functions only basic checks are performed. More checks are done
>> in ppl_validate_rdev() when starting the log.
>>
>> These attributes are writable to allow enabling PPL for external
>> metadata arrays and (later) to enable/disable PPL for a running array.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Documentation/admin-guide/md.rst |  32 ++++++++++-
>>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 144 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>> index 1e61bf50595c..84de718f24a4 100644
>> --- a/Documentation/admin-guide/md.rst
>> +++ b/Documentation/admin-guide/md.rst
>> @@ -276,14 +276,14 @@ All md devices contain:
>>       array creation it will default to 0, though starting the array as
>>       ``clean`` will set it much larger.
>>  
>> -   new_dev
>> +  new_dev
>>       This file can be written but not read.  The value written should
>>       be a block device number as major:minor.  e.g. 8:0
>>       This will cause that device to be attached to the array, if it is
>>       available.  It will then appear at md/dev-XXX (depending on the
>>       name of the device) and further configuration is then possible.
>>  
>> -   safe_mode_delay
>> +  safe_mode_delay
>>       When an md array has seen no write requests for a certain period
>>       of time, it will be marked as ``clean``.  When another write
>>       request arrives, the array is marked as ``dirty`` before the write
>> @@ -292,7 +292,7 @@ All md devices contain:
>>       period as a number of seconds.  The default is 200msec (0.200).
>>       Writing a value of 0 disables safemode.
>>  
>> -   array_state
>> +  array_state
>>       This file contains a single word which describes the current
>>       state of the array.  In many cases, the state can be set by
>>       writing the word for the desired state, however some states
>> @@ -401,7 +401,30 @@ All md devices contain:
>>       once the array becomes non-degraded, and this fact has been
>>       recorded in the metadata.
>>  
>> +  consistency_policy
>> +     This indicates how the array maintains consistency in case of unexpected
>> +     shutdown. It can be:
>>  
>> +     none
>> +       Array has no redundancy information, e.g. raid0, linear.
>> +
>> +     resync
>> +       Full resync is performed and all redundancy is regenerated when the
>> +       array is started after unclean shutdown.
>> +
>> +     bitmap
>> +       Resync assisted by a write-intent bitmap.
>> +
>> +     journal
>> +       For raid4/5/6, journal device is used to log transactions and replay
>> +       after unclean shutdown.
>> +
>> +     ppl
>> +       For raid5 only, Partial Parity Log is used to close the write hole and
>> +       eliminate resync.
>> +
>> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
>> +     used to enable and disable PPL.
>>  
>>  
>>  As component devices are added to an md array, they appear in the ``md``
>> @@ -563,6 +586,9 @@ Each directory contains:
>>  	adds bad blocks without acknowledging them. This is largely
>>  	for testing.
>>  
>> +      ppl_sector, ppl_size
>> +        Location and size (in sectors) of the space used for Partial Parity Log
>> +        on this device.
>>  
>>  
>>  An active md device will also contain an entry for each active device
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index c2028007b209..3ff979d538d4 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -3157,6 +3157,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
>>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
>>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
>>  
>> +static ssize_t
>> +ppl_sector_show(struct md_rdev *rdev, char *page)
>> +{
>> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
>> +}
>> +
>> +static ssize_t
>> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
>> +{
>> +	unsigned long long sector;
>> +
>> +	if (kstrtoull(buf, 10, &sector) < 0)
>> +		return -EINVAL;
> 
> Maybe use base 0 to be more friendly

OK, good idea.

>> +	if (sector != (sector_t)sector)
>> +		return -EINVAL;
>> +
>> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
>> +	    rdev->raid_disk >= 0)
>> +		return -EBUSY;
>> +
>> +	if (rdev->mddev->persistent) {
>> +		if (rdev->mddev->major_version == 0)
>> +			return -EINVAL;
>> +		if ((sector > rdev->sb_start &&
>> +		     sector - rdev->sb_start > S16_MAX) ||
>> +		    (sector < rdev->sb_start &&
>> +		     rdev->sb_start - sector > -S16_MIN))
>> +			return -EINVAL;
> 
> Don't check if the address overlaps with data?

It is checked in ppl_validate_rdev() when enabling ppl or adding a disk
to an array with ppl.

>> +		rdev->ppl.offset = sector - rdev->sb_start;
>> +	} else if (!rdev->mddev->external) {
>> +		return -EBUSY;
>> +	}
> 
> generally we don't use {} for single line code

I thought this conforms to the coding-style doc? It says: "This does not
apply if only one branch of a conditional statement is a single
statement; in the latter case use braces in both branches"

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v4 5/7] raid5-ppl: load and recover the log
From: Artur Paszkiewicz @ 2017-03-01 16:14 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170228011211.iveohkxlokujmy5x@kernel.org>

On 02/28/2017 02:12 AM, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 08:43:59PM +0100, Artur Paszkiewicz wrote:
>> Load the log from each disk when starting the array and recover if the
>> array is dirty.
>>
>> The initial empty PPL is written by mdadm. When loading the log we
>> verify the header checksum and signature. For external metadata arrays
>> the signature is verified in userspace, so here we read it from the
>> header, verifying only if it matches on all disks, and use it later when
>> writing PPL.
>>
>> In addition to the header checksum, each header entry also contains a
>> checksum of its partial parity data. If the header is valid, recovery is
>> performed for each entry until an invalid entry is found. If the array
>> is not degraded and recovery using PPL fully succeeds, there is no need
>> to resync the array because data and parity will be consistent, so in
>> this case resync will be disabled.
>>
>> Due to compatibility with IMSM implementations on other systems, we
>> can't assume that the recovery data block size is always 4K. Writes
>> generated by MD raid5 don't have this issue, but when recovering PPL
>> written in other environments it is possible to have entries with
>> 512-byte sector granularity. The recovery code takes this into account
>> and also the logical sector size of the underlying drives.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c     |   5 +-
>>  2 files changed, 487 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> index a00cabf1adf6..a7693353243a 100644
>> --- a/drivers/md/raid5-ppl.c
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -16,6 +16,7 @@
>>  #include <linux/blkdev.h>
>>  #include <linux/slab.h>
>>  #include <linux/crc32c.h>
>> +#include <linux/async_tx.h>
>>  #include <linux/raid/md_p.h>
>>  #include "md.h"
>>  #include "raid5.h"
>> @@ -84,6 +85,10 @@ struct ppl_conf {
>>  	mempool_t *io_pool;
>>  	struct bio_set *bs;
>>  	mempool_t *meta_pool;
>> +
>> +	/* used only for recovery */
>> +	int recovered_entries;
>> +	int mismatch_count;
>>  };
>>  
>>  struct ppl_log {
>> @@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>>  		ppl_io_unit_finished(io);
>>  }
>>  
>> +static void ppl_xor(int size, struct page *page1, struct page *page2,
>> +		    struct page *page_result)
>> +{
>> +	struct async_submit_ctl submit;
>> +	struct dma_async_tx_descriptor *tx;
>> +	struct page *xor_srcs[] = { page1, page2 };
>> +
>> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
>> +			  NULL, NULL, NULL, NULL);
>> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
>> +
>> +	async_tx_quiesce(&tx);
>> +}
>> +
>> +/*
>> + * PPL recovery strategy: xor partial parity and data from all modified data
>> + * disks within a stripe and write the result as the new stripe parity. If all
>> + * stripe data disks are modified (full stripe write), no partial parity is
>> + * available, so just xor the data disks.
>> + *
>> + * Recovery of a PPL entry shall occur only if all modified data disks are
>> + * available and read from all of them succeeds.
>> + *
>> + * A PPL entry applies to a stripe, partial parity size for an entry is at most
>> + * the size of the chunk. Examples of possible cases for a single entry:
>> + *
>> + * case 0: single data disk write:
>> + *   data0    data1    data2     ppl        parity
>> + * +--------+--------+--------+           +--------------------+
>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
>> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>> + * +--------+--------+--------+           +--------------------+
>> + * pp_size = data_size
>> + *
>> + * case 1: more than one data disk write:
>> + *   data0    data1    data2     ppl        parity
>> + * +--------+--------+--------+           +--------------------+
>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
>> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
>> + * | ------ | ------ | ------ | +----+    | (no change)        |
>> + * +--------+--------+--------+           +--------------------+
>> + * pp_size = data_size / modified_data_disks
>> + *
>> + * case 2: write to all data disks (also full stripe write):
>> + *   data0    data1    data2                parity
>> + * +--------+--------+--------+           +--------------------+
>> + * | ------ | ------ | ------ |           | (no change)        |
>> + * | -data- | -data- | -data- | --------> | xor all data       |
>> + * | ------ | ------ | ------ | --------> | (no change)        |
>> + * | ------ | ------ | ------ |           | (no change)        |
>> + * +--------+--------+--------+           +--------------------+
>> + * pp_size = 0
>> + *
>> + * The following cases are possible only in other implementations. The recovery
>> + * code can handle them, but they are not generated at runtime because they can
>> + * be reduced to cases 0, 1 and 2:
>> + *
>> + * case 3:
>> + *   data0    data1    data2     ppl        parity
>> + * +--------+--------+--------+ +----+    +--------------------+
>> + * | ------ | -data- | -data- | | pp |    | data1 ^ data2 ^ pp |
>> + * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp |
>> + * | -data- | -data- | -data- | | -- | -> | xor all data       |
>> + * | -data- | -data- | ------ | | pp |    | data0 ^ data1 ^ pp |
>> + * +--------+--------+--------+ +----+    +--------------------+
>> + * pp_size = chunk_size
>> + *
>> + * case 4:
>> + *   data0    data1    data2     ppl        parity
>> + * +--------+--------+--------+ +----+    +--------------------+
>> + * | ------ | -data- | ------ | | pp |    | data1 ^ pp         |
>> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
>> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
>> + * | -data- | ------ | ------ | | pp |    | data0 ^ pp         |
>> + * +--------+--------+--------+ +----+    +--------------------+
>> + * pp_size = chunk_size
>> + */
>> +static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
>> +			     sector_t ppl_sector)
>> +{
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct mddev *mddev = ppl_conf->mddev;
>> +	struct r5conf *conf = mddev->private;
>> +	int block_size = ppl_conf->block_size;
>> +	struct page *pages;
>> +	struct page *page1;
>> +	struct page *page2;
>> +	sector_t r_sector_first;
>> +	sector_t r_sector_last;
>> +	int strip_sectors;
>> +	int data_disks;
>> +	int i;
>> +	int ret = 0;
>> +	char b[BDEVNAME_SIZE];
>> +	unsigned int pp_size = le32_to_cpu(e->pp_size);
>> +	unsigned int data_size = le32_to_cpu(e->data_size);
>> +
>> +	r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9);
>> +
>> +	if ((pp_size >> 9) < conf->chunk_sectors) {
>> +		if (pp_size > 0) {
>> +			data_disks = data_size / pp_size;
>> +			strip_sectors = pp_size >> 9;
>> +		} else {
>> +			data_disks = conf->raid_disks - conf->max_degraded;
>> +			strip_sectors = (data_size >> 9) / data_disks;
>> +		}
>> +		r_sector_last = r_sector_first +
>> +				(data_disks - 1) * conf->chunk_sectors +
>> +				strip_sectors;
>> +	} else {
>> +		data_disks = conf->raid_disks - conf->max_degraded;
>> +		strip_sectors = conf->chunk_sectors;
>> +		r_sector_last = r_sector_first + (data_size >> 9);
>> +	}
>> +
>> +	pages = alloc_pages(GFP_KERNEL, 1);
> 
> order != 0 allocation should be avoid if possible. Here sounds not necessary

OK, I will allocate and free them separately.

>> +	if (!pages)
>> +		return -ENOMEM;
>> +	page1 = pages;
>> +	page2 = pages + 1;
>> +
>> +	pr_debug("%s: array sector first: %llu last: %llu\n", __func__,
>> +		 (unsigned long long)r_sector_first,
>> +		 (unsigned long long)r_sector_last);
>> +
>> +	/* if start and end is 4k aligned, use a 4k block */
>> +	if (block_size == 512 &&
>> +	    (r_sector_first & (STRIPE_SECTORS - 1)) == 0 &&
>> +	    (r_sector_last & (STRIPE_SECTORS - 1)) == 0)
>> +		block_size = STRIPE_SIZE;
>> +
>> +	/* iterate through blocks in strip */
>> +	for (i = 0; i < strip_sectors; i += (block_size >> 9)) {
>> +		bool update_parity = false;
>> +		sector_t parity_sector;
>> +		struct md_rdev *parity_rdev;
>> +		struct stripe_head sh;
>> +		int disk;
>> +		int indent = 0;
>> +
>> +		pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i);
>> +		indent += 2;
>> +
>> +		memset(page_address(page1), 0, PAGE_SIZE);
>> +
>> +		/* iterate through data member disks */
>> +		for (disk = 0; disk < data_disks; disk++) {
>> +			int dd_idx;
>> +			struct md_rdev *rdev;
>> +			sector_t sector;
>> +			sector_t r_sector = r_sector_first + i +
>> +					    (disk * conf->chunk_sectors);
>> +
>> +			pr_debug("%s:%*s data member disk %d start\n",
>> +				 __func__, indent, "", disk);
>> +			indent += 2;
>> +
>> +			if (r_sector >= r_sector_last) {
>> +				pr_debug("%s:%*s array sector %llu doesn't need parity update\n",
>> +					 __func__, indent, "",
>> +					 (unsigned long long)r_sector);
>> +				indent -= 2;
>> +				continue;
>> +			}
>> +
>> +			update_parity = true;
>> +
>> +			/* map raid sector to member disk */
>> +			sector = raid5_compute_sector(conf, r_sector, 0,
>> +						      &dd_idx, NULL);
>> +			pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n",
>> +				 __func__, indent, "",
>> +				 (unsigned long long)r_sector, dd_idx,
>> +				 (unsigned long long)sector);
>> +
>> +			rdev = conf->disks[dd_idx].rdev;
>> +			if (!rdev) {
>> +				pr_debug("%s:%*s data member disk %d missing\n",
>> +					 __func__, indent, "", dd_idx);
>> +				update_parity = false;
>> +				break;
>> +			}
>> +
>> +			pr_debug("%s:%*s reading data member disk %s sector %llu\n",
>> +				 __func__, indent, "", bdevname(rdev->bdev, b),
>> +				 (unsigned long long)sector);
>> +			if (!sync_page_io(rdev, sector, block_size, page2,
>> +					REQ_OP_READ, 0, false)) {
>> +				md_error(mddev, rdev);
>> +				pr_debug("%s:%*s read failed!\n", __func__,
>> +					 indent, "");
>> +				ret = -EIO;
>> +				goto out;
>> +			}
>> +
>> +			ppl_xor(block_size, page1, page2, page1);
>> +
>> +			indent -= 2;
>> +		}
>> +
>> +		if (!update_parity)
>> +			continue;
>> +
>> +		if (pp_size > 0) {
>> +			pr_debug("%s:%*s reading pp disk sector %llu\n",
>> +				 __func__, indent, "",
>> +				 (unsigned long long)(ppl_sector + i));
>> +			if (!sync_page_io(log->rdev,
>> +					ppl_sector - log->rdev->data_offset + i,
>> +					block_size, page2, REQ_OP_READ, 0,
>> +					false)) {
>> +				pr_debug("%s:%*s read failed!\n", __func__,
>> +					 indent, "");
>> +				md_error(mddev, log->rdev);
>> +				ret = -EIO;
>> +				goto out;
>> +			}
>> +
>> +			ppl_xor(block_size, page1, page2, page1);
>> +		}
>> +
>> +		/* map raid sector to parity disk */
>> +		parity_sector = raid5_compute_sector(conf, r_sector_first + i,
>> +				0, &disk, &sh);
>> +		BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
>> +		parity_rdev = conf->disks[sh.pd_idx].rdev;
>> +
>> +		BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev);
>> +		pr_debug("%s:%*s write parity at sector %llu, disk %s\n",
>> +			 __func__, indent, "",
>> +			 (unsigned long long)parity_sector,
>> +			 bdevname(parity_rdev->bdev, b));
>> +		if (!sync_page_io(parity_rdev, parity_sector, block_size,
>> +				page1, REQ_OP_WRITE, 0, false)) {
>> +			pr_debug("%s:%*s parity write error!\n", __func__,
>> +				 indent, "");
>> +			md_error(mddev, parity_rdev);
>> +			ret = -EIO;
>> +			goto out;
>> +		}
>> +	}
>> +out:
>> +	__free_pages(pages, 1);
>> +	return ret;
>> +}
>> +
>> +static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
>> +{
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct md_rdev *rdev = log->rdev;
>> +	struct mddev *mddev = rdev->mddev;
>> +	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
>> +	struct page *page;
>> +	int i;
>> +	int ret = 0;
>> +
>> +	page = alloc_page(GFP_KERNEL);
>> +	if (!page)
>> +		return -ENOMEM;
>> +
>> +	/* iterate through all PPL entries saved */
>> +	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
>> +		struct ppl_header_entry *e = &pplhdr->entries[i];
>> +		u32 pp_size = le32_to_cpu(e->pp_size);
>> +		sector_t sector = ppl_sector;
>> +		int ppl_entry_sectors = pp_size >> 9;
>> +		u32 crc, crc_stored;
>> +
>> +		pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n",
>> +			 __func__, rdev->raid_disk, i,
>> +			 (unsigned long long)ppl_sector, pp_size);
>> +
>> +		crc = ~0;
>> +		crc_stored = le32_to_cpu(e->checksum);
>> +
>> +		/* read parial parity for this entry and calculate its checksum */
>> +		while (pp_size) {
>> +			int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size;
>> +
>> +			if (!sync_page_io(rdev, sector - rdev->data_offset,
>> +					s, page, REQ_OP_READ, 0, false)) {
>> +				md_error(mddev, rdev);
>> +				ret = -EIO;
>> +				goto out;
>> +			}
>> +
>> +			crc = crc32c_le(crc, page_address(page), s);
>> +
>> +			pp_size -= s;
>> +			sector += s >> 9;
>> +		}
>> +
>> +		crc = ~crc;
>> +
>> +		if (crc != crc_stored) {
>> +			/*
>> +			 * Don't recover this entry if the checksum does not
>> +			 * match, but keep going and try to recover other
>> +			 * entries.
>> +			 */
>> +			pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n",
>> +				 __func__, crc_stored, crc);
>> +			ppl_conf->mismatch_count++;
>> +		} else {
>> +			ret = ppl_recover_entry(log, e, ppl_sector);
>> +			if (ret)
>> +				goto out;
>> +			ppl_conf->recovered_entries++;
>> +		}
>> +
>> +		ppl_sector += ppl_entry_sectors;
>> +	}
>> +out:
>> +	__free_page(page);
>> +	return ret;
>> +}
>> +
>> +static int ppl_write_empty_header(struct ppl_log *log)
>> +{
>> +	struct page *page;
>> +	struct ppl_header *pplhdr;
>> +	struct md_rdev *rdev = log->rdev;
>> +	int ret = 0;
>> +
>> +	pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__,
>> +		 rdev->raid_disk, (unsigned long long)rdev->ppl.sector);
>> +
>> +	page = alloc_page(GFP_NOIO | __GFP_ZERO);
>> +	if (!page)
>> +		return -ENOMEM;
>> +
>> +	pplhdr = page_address(page);
>> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
>> +	pplhdr->signature = cpu_to_le32(log->ppl_conf->signature);
>> +	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
>> +
>> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
>> +			  PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) {
> 
> write with FUA? otherwise, the empty header might not set down in the disks.

OK.

>> +		md_error(rdev->mddev, rdev);
>> +		ret = -EIO;
>> +	}
>> +
>> +	__free_page(page);
>> +	return ret;
>> +}
>> +
>> +static int ppl_load_distributed(struct ppl_log *log)
>> +{
>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>> +	struct md_rdev *rdev = log->rdev;
>> +	struct mddev *mddev = rdev->mddev;
>> +	struct page *page;
>> +	struct ppl_header *pplhdr;
>> +	u32 crc, crc_stored;
>> +	u32 signature;
>> +	int ret = 0;
>> +
>> +	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
>> +
>> +	/* read PPL header */
>> +	page = alloc_page(GFP_KERNEL);
>> +	if (!page)
>> +		return -ENOMEM;
>> +
>> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
>> +			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
>> +		md_error(mddev, rdev);
>> +		ret = -EIO;
>> +		goto out;
>> +	}
>> +	pplhdr = page_address(page);
>> +
>> +	/* check header validity */
>> +	crc_stored = le32_to_cpu(pplhdr->checksum);
>> +	pplhdr->checksum = 0;
>> +	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
>> +
>> +	if (crc_stored != crc) {
>> +		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
>> +			 __func__, crc_stored, crc);
>> +		ppl_conf->mismatch_count++;
>> +		goto out;
>> +	}
>> +
>> +	signature = le32_to_cpu(pplhdr->signature);
>> +
>> +	if (mddev->external) {
>> +		/*
>> +		 * For external metadata the header signature is set and
>> +		 * validated in userspace.
>> +		 */
>> +		ppl_conf->signature = signature;
>> +	} else if (ppl_conf->signature != signature) {
>> +		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
>> +			 __func__, signature, ppl_conf->signature);
>> +		ppl_conf->mismatch_count++;
>> +		goto out;
>> +	}
>> +
>> +	/* attempt to recover from log if we are starting a dirty array */
>> +	if (!mddev->pers && mddev->recovery_cp != MaxSector)
>> +		ret = ppl_recover(log, pplhdr);
> 
> when could mddev->pers be null?

When the array is inactive - we are starting the array with ppl, not
enabling ppl at runtime.

> we need flush disk cache here. I know you assume disks don't enable cache yet,
> but it's simple to do a flush here.

You are right, I will add it.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v4 6/7] raid5-ppl: support disk hot add/remove with PPL
From: Artur Paszkiewicz @ 2017-03-01 16:14 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170228012226.xo76v3cdgqfddx4k@kernel.org>

On 02/28/2017 02:22 AM, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 08:44:00PM +0100, Artur Paszkiewicz wrote:
>> Add a function to modify the log by removing an rdev when a drive fails
>> or adding when a spare/replacement is activated as a raid member.
>>
>> Removing a disk just clears the child log rdev pointer. No new stripes
>> will be accepted for this child log in ppl_write_stripe() and running io
>> units will be processed without writing PPL to the device.
>>
>> Adding a disk sets the child log rdev pointer and writes an empty PPL
>> header.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5-ppl.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c     | 15 +++++++++++++++
>>  drivers/md/raid5.h     |  8 ++++++++
>>  3 files changed, 70 insertions(+)
>>
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> index a7693353243a..c2070d124849 100644
>> --- a/drivers/md/raid5-ppl.c
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -319,6 +319,12 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
>>  
>>  	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, ppl_conf->bs);
>>  	bio->bi_private = io;
>> +
>> +	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
>> +		ppl_log_endio(bio);
>> +		return;
>> +	}
>> +
>>  	bio->bi_end_io = ppl_log_endio;
>>  	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
>>  	bio->bi_bdev = log->rdev->bdev;
>> @@ -1098,3 +1104,44 @@ int ppl_init_log(struct r5conf *conf)
>>  	ppl_exit_log(ppl_conf);
>>  	return ret;
>>  }
>> +
>> +int ppl_modify_log(struct ppl_conf *ppl_conf, struct md_rdev *rdev,
>> +		   enum ppl_modify_log_operation operation)
>> +{
>> +	struct ppl_log *log;
>> +	int ret = 0;
>> +	char b[BDEVNAME_SIZE];
>> +
>> +	if (!rdev)
>> +		return -EINVAL;
>> +
>> +	pr_debug("%s: disk: %d operation: %s dev: %s\n",
>> +		 __func__, rdev->raid_disk,
>> +		 operation == PPL_MODIFY_LOG_DISK_REMOVE ? "remove" :
>> +		 (operation == PPL_MODIFY_LOG_DISK_ADD ? "add" : "?"),
>> +		 bdevname(rdev->bdev, b));
>> +
>> +	if (rdev->raid_disk < 0)
>> +		return 0;
>> +
>> +	if (rdev->raid_disk >= ppl_conf->count)
>> +		return -ENODEV;
>> +
>> +	log = &ppl_conf->child_logs[rdev->raid_disk];
>> +
>> +	mutex_lock(&log->io_mutex);
>> +	if (operation == PPL_MODIFY_LOG_DISK_REMOVE) {
>> +		log->rdev = NULL;
>> +	} else if (operation == PPL_MODIFY_LOG_DISK_ADD) {
>> +		ret = ppl_validate_rdev(rdev);
>> +		if (!ret) {
>> +			log->rdev = rdev;
>> +			ret = ppl_write_empty_header(log);
>> +		}
>> +	} else {
>> +		ret = -EINVAL;
>> +	}
>> +	mutex_unlock(&log->io_mutex);
>> +
>> +	return ret;
>> +}
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 8b52392457d8..b17e90f06f19 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -7623,6 +7623,12 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>>  			*rdevp = rdev;
>>  		}
>>  	}
>> +	if (conf->ppl) {
>> +		err = ppl_modify_log(conf->ppl, rdev,
>> +				     PPL_MODIFY_LOG_DISK_REMOVE);
>> +		if (err)
>> +			goto abort;
>> +	}
>>  	if (p->replacement) {
>>  		/* We must have just cleared 'rdev' */
>>  		p->rdev = p->replacement;
>> @@ -7632,6 +7638,10 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>>  			   */
>>  		p->replacement = NULL;
>>  		clear_bit(WantReplacement, &rdev->flags);
>> +
>> +		if (conf->ppl)
>> +			err = ppl_modify_log(conf->ppl, p->rdev,
>> +					     PPL_MODIFY_LOG_DISK_ADD);
>>  	} else
>>  		/* We might have just removed the Replacement as faulty-
>>  		 * clear the bit just in case
>> @@ -7695,6 +7705,11 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
>>  			if (rdev->saved_raid_disk != disk)
>>  				conf->fullsync = 1;
>>  			rcu_assign_pointer(p->rdev, rdev);
>> +
>> +			if (conf->ppl)
>> +				err = ppl_modify_log(conf->ppl, rdev,
>> +						     PPL_MODIFY_LOG_DISK_ADD);
>> +
>>  			goto out;
>>  		}
>>  	}
>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index f915a7a0e752..43cfa5fa71b3 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -806,4 +806,12 @@ extern void ppl_exit_log(struct ppl_conf *log);
>>  extern int ppl_write_stripe(struct ppl_conf *log, struct stripe_head *sh);
>>  extern void ppl_write_stripe_run(struct ppl_conf *log);
>>  extern void ppl_stripe_write_finished(struct stripe_head *sh);
>> +
>> +enum ppl_modify_log_operation {
>> +	PPL_MODIFY_LOG_DISK_REMOVE,
>> +	PPL_MODIFY_LOG_DISK_ADD,
>> +};
> 
> a bool for this parameter is better
> 
>> +extern int ppl_modify_log(struct ppl_conf *log, struct md_rdev *rdev,
>> +			  enum ppl_modify_log_operation operation);
> 
> So this should be put into raid5-log.h and rename to log_modify_log(conf, xxx)
> or other sensible name.

OK, I will do that.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH v4 7/7] raid5-ppl: runtime PPL enabling or disabling
From: Artur Paszkiewicz @ 2017-03-01 16:14 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid
In-Reply-To: <20170228013135.g4x7cu63rakvpacx@kernel.org>

On 02/28/2017 02:31 AM, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 08:44:01PM +0100, Artur Paszkiewicz wrote:
>> Allow writing to 'consistency_policy' attribute when the array is
>> active. Add a new function 'change_consistency_policy' to the
>> md_personality operations structure to handle the change in the
>> personality code. Values "ppl" and "resync" are accepted and
>> turn PPL on and off respectively.
>>
>> When enabling PPL its location and size should first be set using
>> 'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
>> written at this location on each member device.
>>
>> Enabling or disabling PPL is performed under a suspended array.  The
>> raid5_reset_stripe_cache function frees the stripe cache and allocates
>> it again in order to allocate or free the ppl_pages for the stripes in
>> the stripe cache.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/md.c        | 12 ++++++++---
>>  drivers/md/md.h        |  2 ++
>>  drivers/md/raid5-ppl.c |  4 ++++
>>  drivers/md/raid5.c     | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 73 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 3ff979d538d4..b70e19513588 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -5003,14 +5003,20 @@ consistency_policy_show(struct mddev *mddev, char *page)
>>  static ssize_t
>>  consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
>>  {
>> +	int err = 0;
>> +
>>  	if (mddev->pers) {
>> -		return -EBUSY;
>> +		if (mddev->pers->change_consistency_policy)
>> +			err = mddev->pers->change_consistency_policy(mddev, buf);
>> +		else
>> +			err = -EBUSY;
>>  	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
>>  		set_bit(MD_HAS_PPL, &mddev->flags);
>> -		return len;
>>  	} else {
>> -		return -EINVAL;
>> +		err = -EINVAL;
>>  	}
>> +
>> +	return err ? err : len;
>>  }
>>  
>>  static struct md_sysfs_entry md_consistency_policy =
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 66a5a16e79f7..88a5155c152e 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -548,6 +548,8 @@ struct md_personality
>>  	/* congested implements bdi.congested_fn().
>>  	 * Will not be called while array is 'suspended' */
>>  	int (*congested)(struct mddev *mddev, int bits);
>> +	/* Changes the consistency policy of an active array. */
>> +	int (*change_consistency_policy)(struct mddev *mddev, const char *buf);
>>  };
>>  
>>  struct md_sysfs_entry {
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> index c2070d124849..a5c4d3333bce 100644
>> --- a/drivers/md/raid5-ppl.c
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -1095,6 +1095,10 @@ int ppl_init_log(struct r5conf *conf)
>>  		 */
>>  		mddev->recovery_cp = MaxSector;
>>  		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
>> +	} else if (mddev->pers && ppl_conf->mismatch_count > 0) {
>> +		/* no mismatch allowed when enabling PPL for a running array */
>> +		ret = -EINVAL;
>> +		goto err;
>>  	}
>>  
>>  	conf->ppl = ppl_conf;
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index b17e90f06f19..754fb8e1c76f 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -8319,6 +8319,63 @@ static void *raid6_takeover(struct mddev *mddev)
>>  	return setup_conf(mddev);
>>  }
>>  
>> +static void raid5_reset_stripe_cache(struct mddev *mddev)
>> +{
>> +	struct r5conf *conf = mddev->private;
>> +
>> +	mutex_lock(&conf->cache_size_mutex);
>> +	while (conf->max_nr_stripes &&
>> +	       drop_one_stripe(conf))
>> +		;
>> +	while (conf->min_nr_stripes > conf->max_nr_stripes &&
>> +	       grow_one_stripe(conf, GFP_KERNEL))
>> +		;
>> +	mutex_unlock(&conf->cache_size_mutex);
>> +}
>> +
>> +static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
>> +{
>> +	struct r5conf *conf;
>> +	int err;
>> +
>> +	err = mddev_lock(mddev);
>> +	if (err)
>> +		return err;
>> +	conf = mddev->private;
>> +	if (!conf) {
>> +		mddev_unlock(mddev);
>> +		return -ENODEV;
>> +	}
>> +
>> +	if (strncmp(buf, "ppl", 3) == 0 &&
>> +	    !test_bit(MD_HAS_PPL, &mddev->flags)) {
>> +		mddev_suspend(mddev);
>> +		err = ppl_init_log(conf);
> 
> I didn't see you check if all rdev have valid ppl setting in ppl_load. Am I
> missing anything? Is it possible some rdevs have ppl set, but some not. In the
> case, we should just bail out.

All rdevs should have valid ppl settings, this is checked in
ppl_validate_rdev(), just before ppl_load() is called.

> Also check if this is raid5

pers->change_consistency_policy is only set for raid5_personality, so it
is (currently) not necessary.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH] mdadm: add checking the clustered bitmap in assemble mode
From: Zhilong @ 2017-03-01 16:32 UTC (permalink / raw)
  To: Coly Li; +Cc: Jes.Sorensen, linux-raid
In-Reply-To: <66e3dbf7-5851-0895-7de8-baf48375fead@suse.de>



> 在 2017年3月1日，20:03，Coly Li <colyli@suse.de> 写道：
> 
>> On 2017/3/1 下午3:47, Zhilong Liu wrote:
>> 1. both clustered and internal array don't need to specify
>> --bitmap when assembling array.
>> 2. clustered array doesn't support external bitmap mode.
>> 
>> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>> 
>> diff --git a/mdadm.c b/mdadm.c
>> index b5d89e4..b08cace 100644
>> --- a/mdadm.c
>> +++ b/mdadm.c
>> @@ -1095,8 +1095,14 @@ int main(int argc, char *argv[])
>>                pr_err("bitmap file needed with -b in --assemble mode\n");
>>                exit(2);
>>            }
>> -            if (strcmp(optarg, "internal") == 0) {
>> -                pr_err("there is no need to specify --bitmap when assembling arrays with internal bitmaps\n");
>> +            if (strcmp(optarg, "internal") == 0 ||
>> +                strcmp(optarg, "clustered") == 0) {
>> +                pr_err("no need to specify --bitmap when assembling arrays with internal or clustered bitmaps\n");
> 
> Is it important for 80 characters width limitation ?
> 
To keep good style, the suggestion is great.
>> +                continue;
> 
> Here if optarg is "clustered", it will continue and won't check the
> following strcmp.
> 
>> +            }
>> +            if (strcmp(optarg, "clustered") == 0 &&
>> +                strchr(optarg, '/') != NULL) {
> 
> I guess this check won't happen at all. Correct me if I am wrong.
Yes, I should do this checking in 'create' and 'grow' mode. The first checking is enough in assemble mode.
> 
>> +                pr_err("clustered array doesn't support external bitmap\n");
>>                continue;
>>            }
>>            bitmap_fd = open(optarg, O_RDWR);
>> 
> 
> Thanks.
> 
> Coly
Many many thanks,
Zhilong

send from iPhone

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: 王金浦 @ 2017-03-01 17:01 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Coly Li, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <20170228194232.ubkw6s3bxnl6nfyt@kernel.org>

2017-02-28 20:42 GMT+01:00 Shaohua Li <shli@kernel.org>:
> On Fri, Feb 24, 2017 at 11:19:05AM +0100, 王金浦 wrote:
>> Hi Coly, Hi Shaohua,
>>
>>
>> >
>> > Hi Shaohua,
>> >
>> > I try to catch up with you, let me try to follow your mind by the
>> > split-in-while-loop condition (this is my new I/O barrier patch). I
>> > assume the original BIO is a write bio, and original bio is split and
>> > handled in a while loop in raid1_make_request().
>>
>> It's still possible for read bio. We hit a deadlock in the past.
>> See https://patchwork.kernel.org/patch/9498949/
>>
>> Also:
>> http://www.spinics.net/lists/raid/msg52792.html
>
> Thanks Jinpu. So this is for the read side, where we don't have plug stuff.
> Yep, that finally makes sense. It's my fault I didn't look at the read side
> code carefully. Looks we need the patch Neil suggested.
>
> Thanks,
> Shaohua

Thanks Shaohua, I will test your patch and report result.

Regards,
Jinpu

^ permalink raw reply

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
From: Phil Turmel @ 2017-03-01 17:23 UTC (permalink / raw)
  To: Wols Lists, ian_bruce, linux-raid
In-Reply-To: <58B6E287.9020804@youngman.org.uk>

On 03/01/2017 10:02 AM, Wols Lists wrote:

> Sorry to drag this up again, but where are these write intent bits
> going to come from? And it's a backup. Why am I going to re-add,
> unless I'm going to wipe the old backup and create a new one?

Given your process below, it's moot.

> And your way is more complicated - more room for sys-admin cock-up
> :-)

I strongly disagree.  This procedure, as shown, is an admin cock-up:

> mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
> ... hotplug sd-big ...
> madam /dev/mdbackup --add /dev/sd-big
> ... wait for sync to finish ...
> mdadm --stop mdbackup
> ... unplug sd-big ...

Are you unmounting /dev/md/home while this is going on?  If not, and
there's any significant activity, your "backup" is corrupt.  If you are
unmounting, your data is unavailable for the duration of the resync.

The corresponding procedure for logical volume in LVM would be:

# lvcreate -n homesnaplv -s homelv --size 10g vg0
# dd if=/dev/vg0/homesnaplv of=/dev/sd-big bs=1M
# lvremove /dev/vg0/homesnaplv

Unlike your solution, the LVM snapshot won't be changing underneath you
during the copy.  The allocated size of the snapshot, shown as 10g
above, only has to be big enough to accommodate the amount of writes to
homelv while the dd is in progress.

Also, LVM understands most mounted filesystems, and will invoke the
proper kernel calls to briefly quiesce the filesystem for the snapshot,
ensuring the filesystem copied out is consistent.  But the user sees
only a few tens or hundreds of milliseconds of hesitation and can keep
going.  Writes to homelv while the snapshot exists generate extra disk
activity (to move the replaced blocks to the snapshot storage, with some
metadata), but is otherwise invisible to the users.

LVM is *made* for this.  You should use it.

Phil

^ permalink raw reply

* Re: [PATCH v4 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: Shaohua Li @ 2017-03-01 17:51 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <d1021e37-f40d-955b-0726-1b64cc4f8064@intel.com>

On Wed, Mar 01, 2017 at 05:13:38PM +0100, Artur Paszkiewicz wrote:
> On 02/28/2017 01:04 AM, Shaohua Li wrote:
> > On Tue, Feb 21, 2017 at 08:43:57PM +0100, Artur Paszkiewicz wrote:
> >> This implements the PPL write logging functionality. The description
> >> of PPL is added to the documentation. More details can be found in the
> >> comments in raid5-ppl.c.
> >>
> >> Put the PPL metadata structures to md_p.h because userspace tools
> >> (mdadm) will also need to read/write PPL.
> >>
> >> Warn about using PPL with enabled disk volatile write-back cache for
> >> now. It can be removed once disk cache flushing before writing PPL is
> >> implemented.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >> ---
> >>  Documentation/md/raid5-ppl.txt |  44 +++
> >>  drivers/md/Makefile            |   2 +-
> >>  drivers/md/raid5-ppl.c         | 617 +++++++++++++++++++++++++++++++++++++++++
> >>  drivers/md/raid5.c             |  49 +++-
> >>  drivers/md/raid5.h             |   8 +
> >>  include/uapi/linux/raid/md_p.h |  26 ++
> >>  6 files changed, 738 insertions(+), 8 deletions(-)
> >>  create mode 100644 Documentation/md/raid5-ppl.txt
> >>  create mode 100644 drivers/md/raid5-ppl.c
> >>
> >> diff --git a/Documentation/md/raid5-ppl.txt b/Documentation/md/raid5-ppl.txt
> >> new file mode 100644
> >> index 000000000000..127072b09363
> >> --- /dev/null
> >> +++ b/Documentation/md/raid5-ppl.txt
> >> @@ -0,0 +1,44 @@
> >> +Partial Parity Log
> >> +
> >> +
> >> +		while (stripes_count--) {
> >> +			/*
> >> +			 * if entry without partial parity just skip its stripes
> >> +			 * without adding pages to bio
> >> +			 */
> >> +			if (pp_size > 0 &&
> >> +			    !bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0)) {
> >> +				struct bio *prev = bio;
> >> +
> >> +				bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES,
> >> +						       ppl_conf->bs);
> > 
> > The code keeps allocating bios but doesn't dispatch any yet. The bioset can't
> > guarantee the allocation successes. To have the guarantee, we must make sure
> > previous bios could finish.
> 
> Is it ok to submit the bio before allocating another from the same
> bioset, without waiting for it to complete? The description for
> bio_alloc_bioset() suggests this, if I understand it correctly: "Callers
> that need to allocate more than 1 bio must always submit the previously
> allocated bio for IO before attempting to allocate a new one." 

Yes, I mean submit.
  
> >>  		return -EINVAL;
> >>  	if (mddev->delta_disks == 0 &&
> >>  	    mddev->new_layout == mddev->layout &&
> >> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> >> index 3cc4cb28f7e6..f915a7a0e752 100644
> >> --- a/drivers/md/raid5.h
> >> +++ b/drivers/md/raid5.h
> >> @@ -230,6 +230,7 @@ struct stripe_head {
> >>  	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */
> >>  
> >>  	struct page		*ppl_page; /* partial parity of this stripe */
> >> +	struct ppl_io_unit	*ppl_log_io;
> > 
> > Maybe we should put the ppl fields and raid5 fields to a union
> 
> Good idea, will do.
> 
> >>  	/**
> >>  	 * struct stripe_operations
> >>  	 * @target - STRIPE_OP_COMPUTE_BLK target
> >> @@ -689,6 +690,7 @@ struct r5conf {
> >>  	int			group_cnt;
> >>  	int			worker_cnt_per_group;
> >>  	struct r5l_log		*log;
> >> +	struct ppl_conf		*ppl;
> > 
> > Maybe use void * log_private, so works for both raid5-cache and ppl
> 
> Do you mean replace "struct ppl_conf *ppl" with "void *log_private"?

Right. But if this makes checking the log type hard, ignore it.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v4 5/7] raid5-ppl: load and recover the log
From: Shaohua Li @ 2017-03-01 17:59 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid
In-Reply-To: <a7dd976b-0ff4-3e2d-7f91-d291540d614a@intel.com>

On Wed, Mar 01, 2017 at 05:14:18PM +0100, Artur Paszkiewicz wrote:
> On 02/28/2017 02:12 AM, Shaohua Li wrote:
> > On Tue, Feb 21, 2017 at 08:43:59PM +0100, Artur Paszkiewicz wrote:
> >> Load the log from each disk when starting the array and recover if the
> >> array is dirty.
> >>
> >> The initial empty PPL is written by mdadm. When loading the log we
> >> verify the header checksum and signature. For external metadata arrays
> >> the signature is verified in userspace, so here we read it from the
> >> header, verifying only if it matches on all disks, and use it later when
> >> writing PPL.
> >>
> >> In addition to the header checksum, each header entry also contains a
> >> checksum of its partial parity data. If the header is valid, recovery is
> >> performed for each entry until an invalid entry is found. If the array
> >> is not degraded and recovery using PPL fully succeeds, there is no need
> >> to resync the array because data and parity will be consistent, so in
> >> this case resync will be disabled.
> >>
> >> Due to compatibility with IMSM implementations on other systems, we
> >> can't assume that the recovery data block size is always 4K. Writes
> >> generated by MD raid5 don't have this issue, but when recovering PPL
> >> written in other environments it is possible to have entries with
> >> 512-byte sector granularity. The recovery code takes this into account
> >> and also the logical sector size of the underlying drives.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >> ---
> >>  drivers/md/raid5-ppl.c | 483 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>  drivers/md/raid5.c     |   5 +-
> >>  2 files changed, 487 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> >> index a00cabf1adf6..a7693353243a 100644
> >> --- a/drivers/md/raid5-ppl.c
> >> +++ b/drivers/md/raid5-ppl.c
> >> @@ -16,6 +16,7 @@
> >>  #include <linux/blkdev.h>
> >>  #include <linux/slab.h>
> >>  #include <linux/crc32c.h>
> >> +#include <linux/async_tx.h>
> >>  #include <linux/raid/md_p.h>
> >>  #include "md.h"
> >>  #include "raid5.h"
> >> @@ -84,6 +85,10 @@ struct ppl_conf {
> >>  	mempool_t *io_pool;
> >>  	struct bio_set *bs;
> >>  	mempool_t *meta_pool;
> >> +
> >> +	/* used only for recovery */
> >> +	int recovered_entries;
> >> +	int mismatch_count;
> >>  };
> >>  
> >>  struct ppl_log {
> >> @@ -450,6 +455,467 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
> >>  		ppl_io_unit_finished(io);
> >>  }
> >>  
> >> +static void ppl_xor(int size, struct page *page1, struct page *page2,
> >> +		    struct page *page_result)
> >> +{
> >> +	struct async_submit_ctl submit;
> >> +	struct dma_async_tx_descriptor *tx;
> >> +	struct page *xor_srcs[] = { page1, page2 };
> >> +
> >> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
> >> +			  NULL, NULL, NULL, NULL);
> >> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
> >> +
> >> +	async_tx_quiesce(&tx);
> >> +}
> >> +
> >> +/*
> >> + * PPL recovery strategy: xor partial parity and data from all modified data
> >> + * disks within a stripe and write the result as the new stripe parity. If all
> >> + * stripe data disks are modified (full stripe write), no partial parity is
> >> + * available, so just xor the data disks.
> >> + *
> >> + * Recovery of a PPL entry shall occur only if all modified data disks are
> >> + * available and read from all of them succeeds.
> >> + *
> >> + * A PPL entry applies to a stripe, partial parity size for an entry is at most
> >> + * the size of the chunk. Examples of possible cases for a single entry:
> >> + *
> >> + * case 0: single data disk write:
> >> + *   data0    data1    data2     ppl        parity
> >> + * +--------+--------+--------+           +--------------------+
> >> + * | ------ | ------ | ------ | +----+    | (no change)        |
> >> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
> >> + * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
> >> + * | ------ | ------ | ------ | +----+    | (no change)        |
> >> + * +--------+--------+--------+           +--------------------+
> >> + * pp_size = data_size
> >> + *
> >> + * case 1: more than one data disk write:
> >> + *   data0    data1    data2     ppl        parity
> >> + * +--------+--------+--------+           +--------------------+
> >> + * | ------ | ------ | ------ | +----+    | (no change)        |
> >> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
> >> + * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
> >> + * | ------ | ------ | ------ | +----+    | (no change)        |
> >> + * +--------+--------+--------+           +--------------------+
> >> + * pp_size = data_size / modified_data_disks
> >> + *
> >> + * case 2: write to all data disks (also full stripe write):
> >> + *   data0    data1    data2                parity
> >> + * +--------+--------+--------+           +--------------------+
> >> + * | ------ | ------ | ------ |           | (no change)        |
> >> + * | -data- | -data- | -data- | --------> | xor all data       |
> >> + * | ------ | ------ | ------ | --------> | (no change)        |
> >> + * | ------ | ------ | ------ |           | (no change)        |
> >> + * +--------+--------+--------+           +--------------------+
> >> + * pp_size = 0
> >> + *
> >> + * The following cases are possible only in other implementations. The recovery
> >> + * code can handle them, but they are not generated at runtime because they can
> >> + * be reduced to cases 0, 1 and 2:
> >> + *
> >> + * case 3:
> >> + *   data0    data1    data2     ppl        parity
> >> + * +--------+--------+--------+ +----+    +--------------------+
> >> + * | ------ | -data- | -data- | | pp |    | data1 ^ data2 ^ pp |
> >> + * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp |
> >> + * | -data- | -data- | -data- | | -- | -> | xor all data       |
> >> + * | -data- | -data- | ------ | | pp |    | data0 ^ data1 ^ pp |
> >> + * +--------+--------+--------+ +----+    +--------------------+
> >> + * pp_size = chunk_size
> >> + *
> >> + * case 4:
> >> + *   data0    data1    data2     ppl        parity
> >> + * +--------+--------+--------+ +----+    +--------------------+
> >> + * | ------ | -data- | ------ | | pp |    | data1 ^ pp         |
> >> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
> >> + * | ------ | ------ | ------ | | -- | -> | (no change)        |
> >> + * | -data- | ------ | ------ | | pp |    | data0 ^ pp         |
> >> + * +--------+--------+--------+ +----+    +--------------------+
> >> + * pp_size = chunk_size
> >> + */
> >> +static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
> >> +			     sector_t ppl_sector)
> >> +{
> >> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> >> +	struct mddev *mddev = ppl_conf->mddev;
> >> +	struct r5conf *conf = mddev->private;
> >> +	int block_size = ppl_conf->block_size;
> >> +	struct page *pages;
> >> +	struct page *page1;
> >> +	struct page *page2;
> >> +	sector_t r_sector_first;
> >> +	sector_t r_sector_last;
> >> +	int strip_sectors;
> >> +	int data_disks;
> >> +	int i;
> >> +	int ret = 0;
> >> +	char b[BDEVNAME_SIZE];
> >> +	unsigned int pp_size = le32_to_cpu(e->pp_size);
> >> +	unsigned int data_size = le32_to_cpu(e->data_size);
> >> +
> >> +	r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9);
> >> +
> >> +	if ((pp_size >> 9) < conf->chunk_sectors) {
> >> +		if (pp_size > 0) {
> >> +			data_disks = data_size / pp_size;
> >> +			strip_sectors = pp_size >> 9;
> >> +		} else {
> >> +			data_disks = conf->raid_disks - conf->max_degraded;
> >> +			strip_sectors = (data_size >> 9) / data_disks;
> >> +		}
> >> +		r_sector_last = r_sector_first +
> >> +				(data_disks - 1) * conf->chunk_sectors +
> >> +				strip_sectors;
> >> +	} else {
> >> +		data_disks = conf->raid_disks - conf->max_degraded;
> >> +		strip_sectors = conf->chunk_sectors;
> >> +		r_sector_last = r_sector_first + (data_size >> 9);
> >> +	}
> >> +
> >> +	pages = alloc_pages(GFP_KERNEL, 1);
> > 
> > order != 0 allocation should be avoid if possible. Here sounds not necessary
> 
> OK, I will allocate and free them separately.
> 
> >> +	if (!pages)
> >> +		return -ENOMEM;
> >> +	page1 = pages;
> >> +	page2 = pages + 1;
> >> +
> >> +	pr_debug("%s: array sector first: %llu last: %llu\n", __func__,
> >> +		 (unsigned long long)r_sector_first,
> >> +		 (unsigned long long)r_sector_last);
> >> +
> >> +	/* if start and end is 4k aligned, use a 4k block */
> >> +	if (block_size == 512 &&
> >> +	    (r_sector_first & (STRIPE_SECTORS - 1)) == 0 &&
> >> +	    (r_sector_last & (STRIPE_SECTORS - 1)) == 0)
> >> +		block_size = STRIPE_SIZE;
> >> +
> >> +	/* iterate through blocks in strip */
> >> +	for (i = 0; i < strip_sectors; i += (block_size >> 9)) {
> >> +		bool update_parity = false;
> >> +		sector_t parity_sector;
> >> +		struct md_rdev *parity_rdev;
> >> +		struct stripe_head sh;
> >> +		int disk;
> >> +		int indent = 0;
> >> +
> >> +		pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i);
> >> +		indent += 2;
> >> +
> >> +		memset(page_address(page1), 0, PAGE_SIZE);
> >> +
> >> +		/* iterate through data member disks */
> >> +		for (disk = 0; disk < data_disks; disk++) {
> >> +			int dd_idx;
> >> +			struct md_rdev *rdev;
> >> +			sector_t sector;
> >> +			sector_t r_sector = r_sector_first + i +
> >> +					    (disk * conf->chunk_sectors);
> >> +
> >> +			pr_debug("%s:%*s data member disk %d start\n",
> >> +				 __func__, indent, "", disk);
> >> +			indent += 2;
> >> +
> >> +			if (r_sector >= r_sector_last) {
> >> +				pr_debug("%s:%*s array sector %llu doesn't need parity update\n",
> >> +					 __func__, indent, "",
> >> +					 (unsigned long long)r_sector);
> >> +				indent -= 2;
> >> +				continue;
> >> +			}
> >> +
> >> +			update_parity = true;
> >> +
> >> +			/* map raid sector to member disk */
> >> +			sector = raid5_compute_sector(conf, r_sector, 0,
> >> +						      &dd_idx, NULL);
> >> +			pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n",
> >> +				 __func__, indent, "",
> >> +				 (unsigned long long)r_sector, dd_idx,
> >> +				 (unsigned long long)sector);
> >> +
> >> +			rdev = conf->disks[dd_idx].rdev;
> >> +			if (!rdev) {
> >> +				pr_debug("%s:%*s data member disk %d missing\n",
> >> +					 __func__, indent, "", dd_idx);
> >> +				update_parity = false;
> >> +				break;
> >> +			}
> >> +
> >> +			pr_debug("%s:%*s reading data member disk %s sector %llu\n",
> >> +				 __func__, indent, "", bdevname(rdev->bdev, b),
> >> +				 (unsigned long long)sector);
> >> +			if (!sync_page_io(rdev, sector, block_size, page2,
> >> +					REQ_OP_READ, 0, false)) {
> >> +				md_error(mddev, rdev);
> >> +				pr_debug("%s:%*s read failed!\n", __func__,
> >> +					 indent, "");
> >> +				ret = -EIO;
> >> +				goto out;
> >> +			}
> >> +
> >> +			ppl_xor(block_size, page1, page2, page1);
> >> +
> >> +			indent -= 2;
> >> +		}
> >> +
> >> +		if (!update_parity)
> >> +			continue;
> >> +
> >> +		if (pp_size > 0) {
> >> +			pr_debug("%s:%*s reading pp disk sector %llu\n",
> >> +				 __func__, indent, "",
> >> +				 (unsigned long long)(ppl_sector + i));
> >> +			if (!sync_page_io(log->rdev,
> >> +					ppl_sector - log->rdev->data_offset + i,
> >> +					block_size, page2, REQ_OP_READ, 0,
> >> +					false)) {
> >> +				pr_debug("%s:%*s read failed!\n", __func__,
> >> +					 indent, "");
> >> +				md_error(mddev, log->rdev);
> >> +				ret = -EIO;
> >> +				goto out;
> >> +			}
> >> +
> >> +			ppl_xor(block_size, page1, page2, page1);
> >> +		}
> >> +
> >> +		/* map raid sector to parity disk */
> >> +		parity_sector = raid5_compute_sector(conf, r_sector_first + i,
> >> +				0, &disk, &sh);
> >> +		BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
> >> +		parity_rdev = conf->disks[sh.pd_idx].rdev;
> >> +
> >> +		BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev);
> >> +		pr_debug("%s:%*s write parity at sector %llu, disk %s\n",
> >> +			 __func__, indent, "",
> >> +			 (unsigned long long)parity_sector,
> >> +			 bdevname(parity_rdev->bdev, b));
> >> +		if (!sync_page_io(parity_rdev, parity_sector, block_size,
> >> +				page1, REQ_OP_WRITE, 0, false)) {
> >> +			pr_debug("%s:%*s parity write error!\n", __func__,
> >> +				 indent, "");
> >> +			md_error(mddev, parity_rdev);
> >> +			ret = -EIO;
> >> +			goto out;
> >> +		}
> >> +	}
> >> +out:
> >> +	__free_pages(pages, 1);
> >> +	return ret;
> >> +}
> >> +
> >> +static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
> >> +{
> >> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> >> +	struct md_rdev *rdev = log->rdev;
> >> +	struct mddev *mddev = rdev->mddev;
> >> +	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
> >> +	struct page *page;
> >> +	int i;
> >> +	int ret = 0;
> >> +
> >> +	page = alloc_page(GFP_KERNEL);
> >> +	if (!page)
> >> +		return -ENOMEM;
> >> +
> >> +	/* iterate through all PPL entries saved */
> >> +	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
> >> +		struct ppl_header_entry *e = &pplhdr->entries[i];
> >> +		u32 pp_size = le32_to_cpu(e->pp_size);
> >> +		sector_t sector = ppl_sector;
> >> +		int ppl_entry_sectors = pp_size >> 9;
> >> +		u32 crc, crc_stored;
> >> +
> >> +		pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n",
> >> +			 __func__, rdev->raid_disk, i,
> >> +			 (unsigned long long)ppl_sector, pp_size);
> >> +
> >> +		crc = ~0;
> >> +		crc_stored = le32_to_cpu(e->checksum);
> >> +
> >> +		/* read parial parity for this entry and calculate its checksum */
> >> +		while (pp_size) {
> >> +			int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size;
> >> +
> >> +			if (!sync_page_io(rdev, sector - rdev->data_offset,
> >> +					s, page, REQ_OP_READ, 0, false)) {
> >> +				md_error(mddev, rdev);
> >> +				ret = -EIO;
> >> +				goto out;
> >> +			}
> >> +
> >> +			crc = crc32c_le(crc, page_address(page), s);
> >> +
> >> +			pp_size -= s;
> >> +			sector += s >> 9;
> >> +		}
> >> +
> >> +		crc = ~crc;
> >> +
> >> +		if (crc != crc_stored) {
> >> +			/*
> >> +			 * Don't recover this entry if the checksum does not
> >> +			 * match, but keep going and try to recover other
> >> +			 * entries.
> >> +			 */
> >> +			pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n",
> >> +				 __func__, crc_stored, crc);
> >> +			ppl_conf->mismatch_count++;
> >> +		} else {
> >> +			ret = ppl_recover_entry(log, e, ppl_sector);
> >> +			if (ret)
> >> +				goto out;
> >> +			ppl_conf->recovered_entries++;
> >> +		}
> >> +
> >> +		ppl_sector += ppl_entry_sectors;
> >> +	}
> >> +out:
> >> +	__free_page(page);
> >> +	return ret;
> >> +}
> >> +
> >> +static int ppl_write_empty_header(struct ppl_log *log)
> >> +{
> >> +	struct page *page;
> >> +	struct ppl_header *pplhdr;
> >> +	struct md_rdev *rdev = log->rdev;
> >> +	int ret = 0;
> >> +
> >> +	pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__,
> >> +		 rdev->raid_disk, (unsigned long long)rdev->ppl.sector);
> >> +
> >> +	page = alloc_page(GFP_NOIO | __GFP_ZERO);
> >> +	if (!page)
> >> +		return -ENOMEM;
> >> +
> >> +	pplhdr = page_address(page);
> >> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
> >> +	pplhdr->signature = cpu_to_le32(log->ppl_conf->signature);
> >> +	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
> >> +
> >> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
> >> +			  PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) {
> > 
> > write with FUA? otherwise, the empty header might not set down in the disks.
> 
> OK.
> 
> >> +		md_error(rdev->mddev, rdev);
> >> +		ret = -EIO;
> >> +	}
> >> +
> >> +	__free_page(page);
> >> +	return ret;
> >> +}
> >> +
> >> +static int ppl_load_distributed(struct ppl_log *log)
> >> +{
> >> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> >> +	struct md_rdev *rdev = log->rdev;
> >> +	struct mddev *mddev = rdev->mddev;
> >> +	struct page *page;
> >> +	struct ppl_header *pplhdr;
> >> +	u32 crc, crc_stored;
> >> +	u32 signature;
> >> +	int ret = 0;
> >> +
> >> +	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
> >> +
> >> +	/* read PPL header */
> >> +	page = alloc_page(GFP_KERNEL);
> >> +	if (!page)
> >> +		return -ENOMEM;
> >> +
> >> +	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
> >> +			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
> >> +		md_error(mddev, rdev);
> >> +		ret = -EIO;
> >> +		goto out;
> >> +	}
> >> +	pplhdr = page_address(page);
> >> +
> >> +	/* check header validity */
> >> +	crc_stored = le32_to_cpu(pplhdr->checksum);
> >> +	pplhdr->checksum = 0;
> >> +	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
> >> +
> >> +	if (crc_stored != crc) {
> >> +		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
> >> +			 __func__, crc_stored, crc);
> >> +		ppl_conf->mismatch_count++;
> >> +		goto out;
> >> +	}
> >> +
> >> +	signature = le32_to_cpu(pplhdr->signature);
> >> +
> >> +	if (mddev->external) {
> >> +		/*
> >> +		 * For external metadata the header signature is set and
> >> +		 * validated in userspace.
> >> +		 */
> >> +		ppl_conf->signature = signature;
> >> +	} else if (ppl_conf->signature != signature) {
> >> +		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
> >> +			 __func__, signature, ppl_conf->signature);
> >> +		ppl_conf->mismatch_count++;
> >> +		goto out;
> >> +	}
> >> +
> >> +	/* attempt to recover from log if we are starting a dirty array */
> >> +	if (!mddev->pers && mddev->recovery_cp != MaxSector)
> >> +		ret = ppl_recover(log, pplhdr);
> > 
> > when could mddev->pers be null?
> 
> When the array is inactive - we are starting the array with ppl, not
> enabling ppl at runtime.

Didn't follow this. Can explain more?

Thanks,
Shaohua

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-03-01 18:04 UTC (permalink / raw)
  To: Milan Broz
  Cc: Rajendra, Herbert Xu, Oded, Mike Snitzer,
	Linux kernel mailing list, Ondrej Mosnacek, linux-raid,
	Gilad Ben-Yossef, dm-devel, Mark Brown, Arnd Bergmann,
	linux-crypto, Shaohua Li, David S. Miller, Alasdair Kergon, Ofir
In-Reply-To: <68f70534-a309-46ba-a84d-8acc1e6620e5@gmail.com>

[-- Attachment #1.1: Type: text/plain, Size: 1616 bytes --]

Hi Milan,

On 1 March 2017 at 02:35, Milan Broz <gmazyland@gmail.com> wrote:

> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
> >
> > I was wondering if this is near to be ready for submission (apart from
> > the testmgr.c
> > changes) or I need to make some changes to make it similar to the IPSec
> offload?
>
> I just tried this and except it registers the IV for every new device
> again, it works...
> (After a while you have many duplicate entries in /proc/crypto.)

It is because the the crypto lookup api sees that the crypto algorithm is in
a LARVAL state and registers a new instance every time by invoking the
".create" callback. I guess it should be solved by adding test data to
testmgr.

Do you have some real performance numbers that proves that such a patch is
> adequate?
>

While waiting to do some implementation of the hw crypto drivers to work
with
dm-crypt, I'll also generate some numbers to compare the performance with
the
original dm-crypt code with the new one with a software implementation in
place.

> I would really like to see the performance issue fixed but I am really not
> sure
> this approach works for everyone. It would be better to avoid repeating
> this exercise later.
> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
> to speedup things even for crypt drivers that do not support own IV
> generators.
>

I think it should work for everyone (even for ciphers not supporting IVs)
if the null IV
mode is used. It should be upto the IV generation template to choose to
generate IV
or just call the underlying (base) template/cipher.

Regards,
Binoy

[-- Attachment #1.2: Type: text/html, Size: 2501 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply

* Re: [BUG] non-metadata arrays cannot use more than 27 component devices
From: Phil Turmel @ 2017-03-01 18:13 UTC (permalink / raw)
  To: Wols Lists, ian_bruce, linux-raid
In-Reply-To: <27373f29-a8ea-012c-102d-9fadf05cb3c2@turmel.org>

On 03/01/2017 12:23 PM, Phil Turmel wrote:
> I strongly disagree.  This procedure, as shown, is an admin cock-up:
> 
>> mdadm --build /dev/mdbackup --device-count 2 /dev/md/home missing
>> ... hotplug sd-big ...
>> madam /dev/mdbackup --add /dev/sd-big
>> ... wait for sync to finish ...
>> mdadm --stop mdbackup
>> ... unplug sd-big ...

One more point.  The above is functionally identical in every respect
to just:

# dd if=/dev/md/home of=/dev/sd-big bs=1M

Why are you bothering to --build an array?

Phil


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox