* [PATCH 2/4] md-cluster: Suspend writes in RAID10 if within range
2017-10-24 7:11 [PATCH 0/3] Add basic support for clustered raid10 Guoqing Jiang
2017-10-24 7:11 ` [PATCH 1/4] md-cluster/raid10: set "do_balance = 0" if area is resyncing Guoqing Jiang
@ 2017-10-24 7:11 ` Guoqing Jiang
2017-10-24 7:11 ` [PATCH 3/4] md-cluster: Use a small window for raid10 resync Guoqing Jiang
2017-10-24 7:11 ` [PATCH 4/4] md-cluster: update document for raid10 Guoqing Jiang
3 siblings, 0 replies; 7+ messages in thread
From: Guoqing Jiang @ 2017-10-24 7:11 UTC (permalink / raw)
To: shli; +Cc: linux-raid, neilb
If there is a resync going on, all nodes must suspend
writes to the range. This is recorded in suspend_info
and suspend_list.
If there is an I/O within the ranges of any of the
suspend_info, area_resyncing will return 1.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
drivers/md/raid10.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 61890231972e..5a32f6749d1c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -25,6 +25,7 @@
#include <linux/seq_file.h>
#include <linux/ratelimit.h>
#include <linux/kthread.h>
+#include <linux/sched/signal.h>
#include <trace/events/block.h>
#include "md.h"
#include "raid10.h"
@@ -1294,6 +1295,22 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
sector_t sectors;
int max_sectors;
+ if ((mddev_is_clustered(mddev) &&
+ md_cluster_ops->area_resyncing(mddev, WRITE,
+ bio->bi_iter.bi_sector,
+ bio_end_sector(bio)))) {
+ DEFINE_WAIT(w);
+ for (;;) {
+ prepare_to_wait(&conf->wait_barrier,
+ &w, TASK_IDLE);
+ if (!md_cluster_ops->area_resyncing(mddev, WRITE,
+ bio->bi_iter.bi_sector, bio_end_sector(bio)))
+ break;
+ schedule();
+ }
+ finish_wait(&conf->wait_barrier, &w);
+ }
+
/*
* Register the new request and wait if the reconstruction
* thread has put up a bar for new requests.
--
2.10.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 3/4] md-cluster: Use a small window for raid10 resync
2017-10-24 7:11 [PATCH 0/3] Add basic support for clustered raid10 Guoqing Jiang
2017-10-24 7:11 ` [PATCH 1/4] md-cluster/raid10: set "do_balance = 0" if area is resyncing Guoqing Jiang
2017-10-24 7:11 ` [PATCH 2/4] md-cluster: Suspend writes in RAID10 if within range Guoqing Jiang
@ 2017-10-24 7:11 ` Guoqing Jiang
2017-10-24 7:11 ` [PATCH 4/4] md-cluster: update document for raid10 Guoqing Jiang
3 siblings, 0 replies; 7+ messages in thread
From: Guoqing Jiang @ 2017-10-24 7:11 UTC (permalink / raw)
To: shli; +Cc: linux-raid, neilb
Suspending the entire device for resync could take
too long. Resync in small chunks.
cluster's resync window is maintained in r10conf as
cluster_sync_low and cluster_sync_high, and processed
in raid10's sync_request(). If the current resync is
outside the cluster resync window:
1. Set the cluster_sync_low to curr_resync_completed.
2. Set cluster_sync_high to cluster_sync_low + stripe
size.
3. Send a message to all nodes so they may add it in
their suspension list.
Note:
We only support "near" raid10 so far, resync a far or
offset raid10 array could have trouble. So raid10_run
checks the layout of clustered raid10, it will refuse
to run if the layout is not correct.
With the "near" layout we process one stripe at a time
progressing monotonically through the address space.
So we can have a sliding window of whole-stripes which
moves through the array suspending IO on other nodes,
and both resync which uses array addresses and recovery
which uses device addresses can stay within this window.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
drivers/md/raid10.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++-
drivers/md/raid10.h | 6 +++
2 files changed, 118 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5a32f6749d1c..557e59bcd535 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -137,10 +137,13 @@ static void r10bio_pool_free(void *r10_bio, void *data)
kfree(r10_bio);
}
+#define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
/* amount of memory to reserve for resync requests */
#define RESYNC_WINDOW (1024*1024)
/* maximum number of concurrent requests, memory permitting */
#define RESYNC_DEPTH (32*1024*1024/RESYNC_BLOCK_SIZE)
+#define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
+#define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
/*
* When performing a resync, we need to read and compare, so
@@ -2842,6 +2845,43 @@ static struct r10bio *raid10_alloc_init_r10buf(struct r10conf *conf)
}
/*
+ * Set cluster_sync_high since we need other nodes to add the
+ * range [cluster_sync_low, cluster_sync_high] to suspend list.
+ */
+static void raid10_set_cluster_sync_high(struct r10conf *conf)
+{
+ sector_t window_size;
+ int extra_chunk, chunks;
+
+ /*
+ * First, here we define "stripe" as a unit which across
+ * all member devices one time, so we get chunks by use
+ * raid_disks / near_copies. Otherwise, if near_copies is
+ * close to raid_disks, then resync window could increases
+ * linearly with the increase of raid_disks, which means
+ * we will suspend a really large IO window while it is not
+ * necessary. If raid_disks is not divisible by near_copies,
+ * an extra chunk is needed to ensure the whole "stripe" is
+ * covered.
+ */
+
+ chunks = conf->geo.raid_disks / conf->geo.near_copies;
+ if (conf->geo.raid_disks % conf->geo.near_copies == 0)
+ extra_chunk = 0;
+ else
+ extra_chunk = 1;
+ window_size = (chunks + extra_chunk) * conf->mddev->chunk_sectors;
+
+ /*
+ * At least use a 32M window to align with raid1's resync window
+ */
+ window_size = (CLUSTER_RESYNC_WINDOW_SECTORS > window_size) ?
+ CLUSTER_RESYNC_WINDOW_SECTORS : window_size;
+
+ conf->cluster_sync_high = conf->cluster_sync_low + window_size;
+}
+
+/*
* perform a "sync" on one "block"
*
* We need to make sure that no normal I/O request - particularly write
@@ -2913,6 +2953,9 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
max_sector = mddev->resync_max_sectors;
if (sector_nr >= max_sector) {
+ conf->cluster_sync_low = 0;
+ conf->cluster_sync_high = 0;
+
/* If we aborted, we need to abort the
* sync on the 'current' bitmap chucks (there can
* be several when recovering multiple devices).
@@ -3267,7 +3310,17 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
/* resync. Schedule a read for every block at this virt offset */
int count = 0;
- bitmap_cond_end_sync(mddev->bitmap, sector_nr, 0);
+ /*
+ * Since curr_resync_completed could probably not update in
+ * time, and we will set cluster_sync_low based on it.
+ * Let's check against "sector_nr + 2 * RESYNC_SECTORS" for
+ * safety reason, which ensures curr_resync_completed is
+ * updated in bitmap_cond_end_sync.
+ */
+ bitmap_cond_end_sync(mddev->bitmap, sector_nr,
+ mddev_is_clustered(mddev) &&
+ (sector_nr + 2 * RESYNC_SECTORS >
+ conf->cluster_sync_high));
if (!bitmap_start_sync(mddev->bitmap, sector_nr,
&sync_blocks, mddev->degraded) &&
@@ -3401,6 +3454,52 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
} while (++page_idx < RESYNC_PAGES);
r10_bio->sectors = nr_sectors;
+ if (mddev_is_clustered(mddev) &&
+ test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
+ /* It is resync not recovery */
+ if (conf->cluster_sync_high < sector_nr + nr_sectors) {
+ conf->cluster_sync_low = mddev->curr_resync_completed;
+ raid10_set_cluster_sync_high(conf);
+ /* Send resync message */
+ md_cluster_ops->resync_info_update(mddev,
+ conf->cluster_sync_low,
+ conf->cluster_sync_high);
+ }
+ } else if (mddev_is_clustered(mddev)) {
+ /* This is recovery not resync */
+ sector_t sect_va1, sect_va2;
+ bool broadcast_msg = false;
+
+ for (i = 0; i < conf->geo.raid_disks; i++) {
+ /*
+ * sector_nr is a device address for recovery, so we
+ * need translate it to array address before compare
+ * with cluster_sync_high.
+ */
+ sect_va1 = raid10_find_virt(conf, sector_nr, i);
+
+ if (conf->cluster_sync_high < sect_va1 + nr_sectors) {
+ broadcast_msg = true;
+ /*
+ * curr_resync_completed is similar as
+ * sector_nr, so make the translation too.
+ */
+ sect_va2 = raid10_find_virt(conf,
+ mddev->curr_resync_completed, i);
+
+ if (conf->cluster_sync_low == 0 ||
+ conf->cluster_sync_low > sect_va2)
+ conf->cluster_sync_low = sect_va2;
+ }
+ }
+ if (broadcast_msg) {
+ raid10_set_cluster_sync_high(conf);
+ md_cluster_ops->resync_info_update(mddev,
+ conf->cluster_sync_low,
+ conf->cluster_sync_high);
+ }
+ }
+
while (biolist) {
bio = biolist;
biolist = biolist->bi_next;
@@ -3660,6 +3759,18 @@ static int raid10_run(struct mddev *mddev)
if (!conf)
goto out;
+ if (mddev_is_clustered(conf->mddev)) {
+ int fc, fo;
+
+ fc = (mddev->layout >> 8) & 255;
+ fo = mddev->layout & (1<<16);
+ if (fc > 1 || fo > 0) {
+ pr_err("only near layout is supported by clustered"
+ " raid10\n");
+ goto out;
+ }
+ }
+
mddev->thread = conf->thread;
conf->thread = NULL;
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 735ce1a3d260..2bef4e8789c8 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -88,6 +88,12 @@ struct r10conf {
* the new thread here until we fully activate the array.
*/
struct md_thread *thread;
+
+ /*
+ * Keep track of cluster resync window to send to other nodes.
+ */
+ sector_t cluster_sync_low;
+ sector_t cluster_sync_high;
};
/*
--
2.10.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 4/4] md-cluster: update document for raid10
2017-10-24 7:11 [PATCH 0/3] Add basic support for clustered raid10 Guoqing Jiang
` (2 preceding siblings ...)
2017-10-24 7:11 ` [PATCH 3/4] md-cluster: Use a small window for raid10 resync Guoqing Jiang
@ 2017-10-24 7:11 ` Guoqing Jiang
2017-10-30 5:44 ` Shaohua Li
3 siblings, 1 reply; 7+ messages in thread
From: Guoqing Jiang @ 2017-10-24 7:11 UTC (permalink / raw)
To: shli; +Cc: linux-raid, neilb
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
Documentation/md/md-cluster.txt | 3 ++-
drivers/md/Kconfig | 5 +++--
drivers/md/md-cluster.c | 2 +-
3 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/Documentation/md/md-cluster.txt b/Documentation/md/md-cluster.txt
index 82ee51604e9a..e1055f105cf5 100644
--- a/Documentation/md/md-cluster.txt
+++ b/Documentation/md/md-cluster.txt
@@ -1,4 +1,5 @@
-The cluster MD is a shared-device RAID for a cluster.
+The cluster MD is a shared-device RAID for a cluster, it supports
+two levels: raid1 and raid10 (limited support).
1. On-disk format
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 4a249ee86364..83b9362be09c 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -178,7 +178,7 @@ config MD_FAULTY
config MD_CLUSTER
- tristate "Cluster Support for MD (EXPERIMENTAL)"
+ tristate "Cluster Support for MD"
depends on BLK_DEV_MD
depends on DLM
default n
@@ -188,7 +188,8 @@ config MD_CLUSTER
nodes in the cluster can access the MD devices simultaneously.
This brings the redundancy (and uptime) of RAID levels across the
- nodes of the cluster.
+ nodes of the cluster. Currently, it can work with raid1 and raid10
+ (limited support).
If unsure, say N.
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index d0fd1bd8575c..79bfbc840385 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -1478,7 +1478,7 @@ static struct md_cluster_operations cluster_ops = {
static int __init cluster_init(void)
{
- pr_warn("md-cluster: EXPERIMENTAL. Use with caution\n");
+ pr_warn("md-cluster: support raid1 and raid10 (limited support)\n");
pr_info("Registering Cluster MD functions\n");
register_md_cluster_operations(&cluster_ops, THIS_MODULE);
return 0;
--
2.10.0
^ permalink raw reply related [flat|nested] 7+ messages in thread