From mboxrd@z Thu Jan  1 00:00:00 1970
From: Guoqing Jiang <gqjiang@suse.com>
Subject: Re: [PATCH 14/14] md-cluster: add the support for resize
Date: Wed, 1 Mar 2017 11:28:46 +0800
Message-ID: <58B63FEE.5010709@suse.com>
References: <1487906124-20107-1-git-send-email-gqjiang@suse.com>
 <1487906124-20107-15-git-send-email-gqjiang@suse.com>
 <20170228192536.r74i2o3vchf74isn@kernel.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <linux-raid-owner@vger.kernel.org>
In-Reply-To: <20170228192536.r74i2o3vchf74isn@kernel.org>
Sender: linux-raid-owner@vger.kernel.org
To: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org, shli@fb.com, neilb@suse.de
List-Id: linux-raid.ids


On 03/01/2017 03:25 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:24AM +0800, Guoqing Jiang wrote:
>> To update size for cluster raid, we need to make
>> sure all nodes can perform the change successfully.
>> However, it is possible that some of them can't do
>> it due to failure (bitmap_resize could fail). So
>> we need to consider the issue before we set the
>> capacity unconditionally, and we use below steps
>> to perform sanity check.
>>
>> 1. A change the size, then broadcast METADATA_UPDATED
>>     msg.
>> 2. B and C receive METADATA_UPDATED change the size
>>     excepts call set_capacity, sync_size is not update
>>     if the change failed. Also call bitmap_update_sb
>>     to sync sb to disk.
>> 3. A checks other node's sync_size, if sync_size has
>>     been updated in all nodes, then send CHANGE_CAPACITY
>>     msg otherwise send msg to revert previous change.
>> 4. B and C call set_capacity if receive CHANGE_CAPACITY
>>     msg, otherwise pers->resize will be called to restore
>>     the old value.
>>
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>   Documentation/md/md-cluster.txt |  2 +-
>>   drivers/md/md-cluster.c         | 75 +++++++++++++++++++++++++++++++++++++++++
>>   drivers/md/md-cluster.h         |  1 +
>>   drivers/md/md.c                 | 21 +++++++++---
>>   4 files changed, 93 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/md/md-cluster.txt b/Documentation/md/md-cluster.txt
>> index 38883276d31c..2663d49dd8a0 100644
>> --- a/Documentation/md/md-cluster.txt
>> +++ b/Documentation/md/md-cluster.txt
>> @@ -321,4 +321,4 @@ The algorithm is:
>>   
>>   There are somethings which are not supported by cluster MD yet.
>>   
>> -- update size and change array_sectors.
>> +- change array_sectors.
>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>> index d3c024e6bfcf..75da83187c31 100644
>> --- a/drivers/md/md-cluster.c
>> +++ b/drivers/md/md-cluster.c
>> @@ -1147,6 +1147,80 @@ int cluster_check_sync_size(struct mddev *mddev)
>>   	return (my_sync_size == sync_size) ? 0 : -1;
>>   }
>>   
>> +/*
>> + * Update the size for cluster raid is a little more complex, we perform it
>> + * by the steps:
>> + * 1. hold token lock and update superblock in initiator node.
>> + * 2. send METADATA_UPDATED msg to other nodes.
>> + * 3. The initiator node continues to check each bitmap's sync_size, if all
>> + *    bitmaps have the same value of sync_size, then we can set capacity and
>> + *    let other nodes to perform it. If one node can't update sync_size
>> + *    accordingly, we need to revert to previous value.
>> + */
>> +static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
>> +{
>> +	struct md_cluster_info *cinfo = mddev->cluster_info;
>> +	struct cluster_msg cmsg;
>> +	struct md_rdev *rdev;
>> +	int ret = 0;
>> +	int raid_slot = -1;
>> +
>> +	md_update_sb(mddev, 1);
>> +	lock_comm(cinfo, 1);
>> +
>> +	memset(&cmsg, 0, sizeof(cmsg));
>> +	cmsg.type = cpu_to_le32(METADATA_UPDATED);
>> +	rdev_for_each(rdev, mddev)
>> +		if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags)) {
>> +			raid_slot = rdev->desc_nr;
>> +			break;
>> +		}
>> +	if (raid_slot >= 0) {
>> +		cmsg.raid_slot = cpu_to_le32(raid_slot);
>> +		/*
>> +		 * We can only change capiticy after all the nodes can do it,
>> +		 * so need to wait after other nodes already received the msg
>> +		 * and handled the change
>> +		 */
>> +		ret = __sendmsg(cinfo, &cmsg);
>> +		if (ret) {
>> +			pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
>> +			       __func__, __LINE__);
>> +			unlock_comm(cinfo);
>> +			return;
>> +		}
>> +	} else {
>> +		pr_err("md-cluster: No good device id found to send\n");
>> +		unlock_comm(cinfo);
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * check the sync_size from other node's bitmap, if sync_size
>> +	 * have already updated in other nodes as expected, send an
>> +	 * empty metadata msg to permit the change of capacity
>> +	 */
>> +	if (cluster_check_sync_size(mddev) == 0) {
>> +		memset(&cmsg, 0, sizeof(cmsg));
>> +		cmsg.type = cpu_to_le32(CHANGE_CAPACITY);
>> +		ret = __sendmsg(cinfo, &cmsg);
>> +		if (ret)
>> +			pr_err("%s:%d: failed to send CHANGE_CAPACITY msg\n",
>> +			       __func__, __LINE__);
>> +		set_capacity(mddev->gendisk, mddev->array_sectors);
> don't call revalidate_disk here? And why don't move the gendisk related stuff to md.c.

Thanks, I will add revalidate_disk after set_capacity. But we can't move 
it to md.c
since md-cluster runs a different way for resize.

>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -6503,10 +6503,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
>>   	struct md_rdev *rdev;
>>   	int rv;
>>   	int fit = (num_sectors == 0);
>> -
>> -	/* cluster raid doesn't support update size */
>> -	if (mddev_is_clustered(mddev))
>> -		return -EINVAL;
>> +	sector_t old_dev_sectors = mddev->dev_sectors;
>>   
>>   	if (mddev->pers->resize == NULL)
>>   		return -EINVAL;
>> @@ -6535,7 +6532,9 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
>>   	}
>>   	rv = mddev->pers->resize(mddev, num_sectors);
>>   	if (!rv) {
>> -		if (mddev->queue) {
>> +		if (mddev_is_clustered(mddev))
>> +			md_cluster_ops->update_size(mddev, old_dev_sectors);
>> +		else if (mddev->queue) {
>>   			set_capacity(mddev->gendisk, mddev->array_sectors);
>>   			revalidate_disk(mddev->gendisk);
>>   		}

You can see the path is different between common md and md-cluster, 
because we have
to check if all the bitmaps have the same sync_size or not, then set 
capacity and revalidate
disk.

>> @@ -8753,6 +8752,18 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
>>   	int role, ret;
>>   	char b[BDEVNAME_SIZE];
>>   
>> +	/*
>> +	 * If size is changed in another node then we need to
>> +	 * do resize as well.
>> +	 */
>> +	if (mddev->dev_sectors != le64_to_cpu(sb->size)) {
>> +		ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size));
>> +		if (ret)
>> +			pr_info("md-cluster: resize failed\n");
>> +		else
>> +			bitmap_update_sb(mddev->bitmap);
>> +	}
> I'm confused, who will trigger this? The patch 10 only calls set_capacity.

process_metadata_update -> md_reload_sb -> check_sb_changes, so which
means if a node received METADATA_UPDATED msg, it will call the path.

And in this patch you may see that update_size sends the msg.

+static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
+{
+	struct md_cluster_info *cinfo = mddev->cluster_info;
+	struct cluster_msg cmsg;
+	struct md_rdev *rdev;
+	int ret = 0;
+	int raid_slot = -1;
+
+	md_update_sb(mddev, 1);
+	lock_comm(cinfo, 1);
+
+	memset(&cmsg, 0, sizeof(cmsg));
+	cmsg.type = cpu_to_le32(METADATA_UPDATED);

> Please describe the details in each node. Also please add it to md-cluster.txt
> document.

Does it help?

diff --git a/Documentation/md/md-cluster.txt 
b/Documentation/md/md-cluster.txt
index 2663d49dd8a0..709439e687c2 100644
--- a/Documentation/md/md-cluster.txt
+++ b/Documentation/md/md-cluster.txt
@@ -71,8 +71,8 @@ There are three groups of locks for managing the device:

   3.1.1 METADATA_UPDATED: informs other nodes that the metadata has
     been updated, and the node must re-read the md superblock. This is
-   performed synchronously. It is primarily used to signal device
-   failure.
+   performed synchronously. We send this message if sb is updated or
+   the size is changed.

Thanks,
Guoqing