linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] mdadm tool: add the support for cluster-md
@ 2015-04-24  7:30 gqjiang
  2015-04-24  7:30 ` [PATCH 01/10] Add nodes option while creating md gqjiang
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

Basic background for Cluster MD: Cluster MD is a shared-device RAID for a
cluster, currently, the implementation is limited to RAID1 but with further
work (and some positive feedback), it could be extend to other RAID levels.

The kernel part code of cluster-md has been sent to maillist several month
ago by Goldywyn, and to make cluster-md works, the mdadm tools also need to
do some changes accordingly.

This patch set extends mdadm tool to aware cluster MD scenario, and handle
related cluster-md scenario.

1. the first part (0001-0007) comes from Goldwyn, which add initial
support for cluster-md, those changes included make mdadm awares nodes,
home-cluster and n bitmaps for clustered mode, also let mdadm can 
confirm disk which is added by another node.


2. the second part is for support change cluster-name and node nums under
assemble mode. Which extend write-bitmap to handle above cases, and also
use the extended write_bitmap for update uuid. [PATCH 10/10] is just compiled
test only.

BTW: this series is based on commit "72a457 IMSM: Count arrays per orom".

Some reltated links:
[1] http://marc.info/?l=linux-raid&m=141891941330336&w=2
[2] http://marc.info/?l=linux-raid&m=141935561418770&w=2

Guoqing Jiang (10):
  Add nodes option while creating md
  home-cluster while creating an array
  Create n bitmaps for clustered mode
  Show all bitmaps while examining bitmap
  Add a new clustered disk
  Convert a bitmap=none device to clustered
  Skip clustered devices in incremental
  mdadm: add the ability to change cluster name
  mdadm: change the num of cluster node
  Reuse the write_bitmap for update uuid

 Assemble.c    |  14 +++++--
 Create.c      |   5 ++-
 Grow.c        |  22 ++++++++---
 Incremental.c |   5 +++
 Makefile      |   1 +
 Manage.c      |  33 +++++++++++++++--
 ReadMe.c      |   3 ++
 bitmap.c      |  55 +++++++++++++--------------
 bitmap.h      |   7 +++-
 config.c      |  27 +++++++++++++-
 md_p.h        |   7 ++++
 md_u.h        |   1 +
 mdadm.8.in    |  27 +++++++++++++-
 mdadm.c       |  70 ++++++++++++++++++++++++++++++++++-
 mdadm.h       |  20 +++++++++-
 super0.c      |   4 +-
 super1.c      | 117 ++++++++++++++++++++++++++++++++++++++++++++++++----------
 util.c        |  61 ++++++++++++++++++++++++++++++
 18 files changed, 408 insertions(+), 71 deletions(-)

-- 
1.7.12.4


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/10] Add nodes option while creating md
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-29  1:30   ` NeilBrown
  2015-04-24  7:30 ` [PATCH 02/10] home-cluster while creating an array gqjiang
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

Specifies the maximum number of nodes in the cluster that may use
this device simultaneously. This is equivalent to the number of
bitmaps created in the internal superblock (patches to follow).

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Create.c   |  1 +
 ReadMe.c   |  1 +
 mdadm.8.in |  5 +++++
 mdadm.c    | 20 +++++++++++++++++++-
 mdadm.h    |  3 +++
 5 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Create.c b/Create.c
index ef28da0..b73f6cb 100644
--- a/Create.c
+++ b/Create.c
@@ -531,6 +531,7 @@ int Create(struct supertype *st, char *mddev,
 				st->ss->name);
 		warn = 1;
 	}
+	st->nodes = c->nodes;
 
 	if (warn) {
 		if (c->runstop!= 1) {
diff --git a/ReadMe.c b/ReadMe.c
index 87a4916..30c569d 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -140,6 +140,7 @@ struct option long_options[] = {
     {"homehost",  1, 0,  HomeHost},
     {"symlinks",  1, 0,  Symlinks},
     {"data-offset",1, 0, DataOffset},
+    {"nodes",1, 0, Nodes},
 
     /* For assemble */
     {"uuid",      1, 0, 'u'},
diff --git a/mdadm.8.in b/mdadm.8.in
index a630310..bd8d59e 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -966,6 +966,11 @@ However for RAID0, it is not possible to add spares.  So to increase
 the number of devices in a RAID0, it is necessary to set the new
 number of devices, and to add the new devices, in the same command.
 
+.TP
+.BR \-\-nodes
+Specify the maximum number of nodes in the cluster that will use this
+device simultaneously. If not specified, this defaults to 4.
+
 .SH For assemble:
 
 .TP
diff --git a/mdadm.c b/mdadm.c
index 3e8c49b..bce6a76 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -588,7 +588,14 @@ int main(int argc, char *argv[])
 			}
 			ident.raid_disks = s.raiddisks;
 			continue;
-
+		case O(CREATE, Nodes):
+			c.nodes = parse_num(optarg);
+			if (c.nodes <= 0) {
+				pr_err("invalid number for the number of "
+						"cluster nodes: %s\n", optarg);
+				exit(2);
+			}
+			continue;
 		case O(CREATE,'x'): /* number of spare (eXtra) disks */
 			if (s.sparedisks) {
 				pr_err("spare-devices set twice: %d and %s\n",
@@ -1377,6 +1384,17 @@ int main(int argc, char *argv[])
 	case CREATE:
 		if (c.delay == 0)
 			c.delay = DEFAULT_BITMAP_DELAY;
+
+		if (!strncmp(s.bitmap_file, "internal", 9) ||
+			!strncmp(s.bitmap_file,"none", 4)) {
+			if (c.nodes) {
+				pr_err("--nodes argument is incompatible with --bitmap=%s.\n",
+					s.bitmap_file);
+				rv = 1;
+				break;
+			}
+		}
+
 		if (s.write_behind && !s.bitmap_file) {
 			pr_err("write-behind mode requires a bitmap.\n");
 			rv = 1;
diff --git a/mdadm.h b/mdadm.h
index 141f963..9d55801 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -344,6 +344,7 @@ enum special_options {
 	Dump,
 	Restore,
 	Action,
+	Nodes,
 };
 
 enum prefix_standard {
@@ -418,6 +419,7 @@ struct context {
 	char	*backup_file;
 	int	invalid_backup;
 	char	*action;
+	int	nodes;
 };
 
 struct shape {
@@ -1029,6 +1031,7 @@ struct supertype {
 			 */
 	int devcnt;
 	int retry_soon;
+	int nodes;
 
 	struct mdinfo *devs;
 
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 02/10] home-cluster while creating an array
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
  2015-04-24  7:30 ` [PATCH 01/10] Add nodes option while creating md gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-24  7:30 ` [PATCH 03/10] Create n bitmaps for clustered mode gqjiang
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

The home-cluster is stored in the bitmap super block of the
array. The device can be assembled on a cluster with the
cluster name same as the one recorded in the bitmap.

If home-cluster is not specified, this is auto-detected using
dlopen corosync cmap library.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Create.c   |  1 +
 Makefile   |  1 +
 ReadMe.c   |  1 +
 config.c   | 27 ++++++++++++++++++++++++++-
 mdadm.8.in |  6 ++++++
 mdadm.c    | 25 +++++++++++++++++++++++++
 mdadm.h    |  5 +++++
 util.c     | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/Create.c b/Create.c
index b73f6cb..cd5485b 100644
--- a/Create.c
+++ b/Create.c
@@ -532,6 +532,7 @@ int Create(struct supertype *st, char *mddev,
 		warn = 1;
 	}
 	st->nodes = c->nodes;
+	st->cluster_name = c->homecluster;
 
 	if (warn) {
 		if (c->runstop!= 1) {
diff --git a/Makefile b/Makefile
index a7d8c5c..431f08b 100644
--- a/Makefile
+++ b/Makefile
@@ -101,6 +101,7 @@ endif
 # If you want a static binary, you might uncomment these
 # LDFLAGS = -static
 # STRIP = -s
+LDLIBS=-ldl
 
 INSTALL = /usr/bin/install
 DESTDIR =
diff --git a/ReadMe.c b/ReadMe.c
index 30c569d..c6286ae 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -141,6 +141,7 @@ struct option long_options[] = {
     {"symlinks",  1, 0,  Symlinks},
     {"data-offset",1, 0, DataOffset},
     {"nodes",1, 0, Nodes},
+    {"home-cluster",1, 0, ClusterName},
 
     /* For assemble */
     {"uuid",      1, 0, 'u'},
diff --git a/config.c b/config.c
index 7342c42..21b6afd 100644
--- a/config.c
+++ b/config.c
@@ -77,7 +77,7 @@ char DefaultAltConfFile[] = CONFFILE2;
 char DefaultAltConfDir[] = CONFFILE2 ".d";
 
 enum linetype { Devices, Array, Mailaddr, Mailfrom, Program, CreateDev,
-		Homehost, AutoMode, Policy, PartPolicy, LTEnd };
+		Homehost, HomeCluster, AutoMode, Policy, PartPolicy, LTEnd };
 char *keywords[] = {
 	[Devices]  = "devices",
 	[Array]    = "array",
@@ -86,6 +86,7 @@ char *keywords[] = {
 	[Program]  = "program",
 	[CreateDev]= "create",
 	[Homehost] = "homehost",
+	[HomeCluster] = "homecluster",
 	[AutoMode] = "auto",
 	[Policy]   = "policy",
 	[PartPolicy]="part-policy",
@@ -562,6 +563,21 @@ void homehostline(char *line)
 	}
 }
 
+static char *home_cluster = NULL;
+void homeclusterline(char *line)
+{
+	char *w;
+
+	for (w=dl_next(line); w != line ; w=dl_next(w)) {
+		if (home_cluster == NULL) {
+			if (strcasecmp(w, "<none>")==0)
+				home_cluster = xstrdup("");
+			else
+				home_cluster = xstrdup(w);
+		}
+	}
+}
+
 char auto_yes[] = "yes";
 char auto_no[] = "no";
 char auto_homehost[] = "homehost";
@@ -724,6 +740,9 @@ void conf_file(FILE *f)
 		case Homehost:
 			homehostline(line);
 			break;
+		case HomeCluster:
+			homeclusterline(line);
+			break;
 		case AutoMode:
 			autoline(line);
 			break;
@@ -884,6 +903,12 @@ char *conf_get_homehost(int *require_homehostp)
 	return home_host;
 }
 
+char *conf_get_homecluster(void)
+{
+	load_conffile();
+	return home_cluster;
+}
+
 struct createinfo *conf_get_create_info(void)
 {
 	load_conffile();
diff --git a/mdadm.8.in b/mdadm.8.in
index bd8d59e..a0e8288 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -415,6 +415,12 @@ This functionality is currently only provided by
 and
 .BR \-\-monitor .
 
+.TP
+.B \-\-home\-cluster=
+specifies the cluster name for the md device. The md device can be assembled
+only on the cluster which matches the name specified. If this option is not
+provided, mdadm tried to detect the cluster name automatically.
+
 .SH For create, build, or grow:
 
 .TP
diff --git a/mdadm.c b/mdadm.c
index bce6a76..e4f8568 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -596,6 +596,13 @@ int main(int argc, char *argv[])
 				exit(2);
 			}
 			continue;
+		case O(CREATE, ClusterName):
+			c.homecluster = optarg;
+			if (strlen(c.homecluster) > 64) {
+				pr_err("Cluster name too big.\n");
+				exit(ERANGE);
+			}
+			continue;
 		case O(CREATE,'x'): /* number of spare (eXtra) disks */
 			if (s.sparedisks) {
 				pr_err("spare-devices set twice: %d and %s\n",
@@ -1267,6 +1274,18 @@ int main(int argc, char *argv[])
 		c.require_homehost = 0;
 	}
 
+	if (c.homecluster == NULL && (c.nodes > 0)) {
+		c.homecluster = conf_get_homecluster();
+		if (c.homecluster == NULL)
+			rv = get_cluster_name(&c.homecluster);
+		if (rv == 0) {
+			c.homehost = xstrdup(c.homecluster);
+			/* Add a : to differentiate between a host
+			 * and a cluster */
+			strcat(c.homehost, ":");
+		}
+	}
+
 	if (c.backup_file && data_offset != INVALID_SECTORS) {
 		pr_err("--backup-file and --data-offset are incompatible\n");
 		exit(2);
@@ -1393,6 +1412,12 @@ int main(int argc, char *argv[])
 				rv = 1;
 				break;
 			}
+			if (c.homecluster) {
+				pr_err("--home-cluster argument is incompatible with --bitmap=%s.\n",
+					s.bitmap_file);
+				rv = 1;
+				break;
+			}
 		}
 
 		if (s.write_behind && !s.bitmap_file) {
diff --git a/mdadm.h b/mdadm.h
index 9d55801..f56d9d6 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -345,6 +345,7 @@ enum special_options {
 	Restore,
 	Action,
 	Nodes,
+	ClusterName,
 };
 
 enum prefix_standard {
@@ -420,6 +421,7 @@ struct context {
 	int	invalid_backup;
 	char	*action;
 	int	nodes;
+	char	*homecluster;
 };
 
 struct shape {
@@ -1032,6 +1034,7 @@ struct supertype {
 	int devcnt;
 	int retry_soon;
 	int nodes;
+	char *cluster_name;
 
 	struct mdinfo *devs;
 
@@ -1308,6 +1311,7 @@ extern char *conf_get_mailaddr(void);
 extern char *conf_get_mailfrom(void);
 extern char *conf_get_program(void);
 extern char *conf_get_homehost(int *require_homehostp);
+extern char *conf_get_homecluster(void);
 extern char *conf_line(FILE *file);
 extern char *conf_word(FILE *file, int allow_key);
 extern void print_quoted(char *str);
@@ -1416,6 +1420,7 @@ extern char *stat2devnm(struct stat *st);
 extern char *fd2devnm(int fd);
 
 extern int in_initrd(void);
+extern int get_cluster_name(char **name);
 
 #define _ROUND_UP(val, base)	(((val) + (base) - 1) & ~(base - 1))
 #define ROUND_UP(val, base)	_ROUND_UP(val, (typeof(val))(base))
diff --git a/util.c b/util.c
index cc98d3b..ed9a745 100644
--- a/util.c
+++ b/util.c
@@ -34,6 +34,8 @@
 #include	<ctype.h>
 #include	<dirent.h>
 #include	<signal.h>
+#include	<dlfcn.h>
+#include	<corosync/cmap.h>
 
 /*
  * following taken from linux/blkpg.h because they aren't
@@ -1976,3 +1978,51 @@ void reopen_mddev(int mdfd)
 	if (fd >= 0 && fd != mdfd)
 		dup2(fd, mdfd);
 }
+
+int get_cluster_name(char **cluster_name)
+{
+        void *lib_handle = NULL;
+        int rv = -1;
+
+        cmap_handle_t handle;
+        static int (*initialize)(cmap_handle_t *handle);
+        static int (*get_string)(cmap_handle_t handle,
+                        const char *string,
+                        char **name);
+        static int (*finalize)(cmap_handle_t handle);
+
+
+        lib_handle = dlopen("libcmap.so.4", RTLD_NOW | RTLD_LOCAL);
+        if (!lib_handle)
+                return rv;
+
+        initialize = dlsym(lib_handle, "cmap_initialize");
+        if (!initialize)
+                goto out;
+
+        get_string = dlsym(lib_handle, "cmap_get_string");
+        if (!get_string)
+                goto out;
+
+        finalize = dlsym(lib_handle, "cmap_finalize");
+        if (!finalize)
+                goto out;
+
+        rv = initialize(&handle);
+        if (rv != CS_OK)
+                goto out;
+
+        rv = get_string(handle, "totem.cluster_name", cluster_name);
+        if (rv != CS_OK) {
+                free(*cluster_name);
+                rv = -1;
+                goto name_err;
+        }
+
+        rv = 0;
+name_err:
+        finalize(handle);
+out:
+        dlclose(lib_handle);
+        return rv;
+}
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 03/10] Create n bitmaps for clustered mode
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
  2015-04-24  7:30 ` [PATCH 01/10] Add nodes option while creating md gqjiang
  2015-04-24  7:30 ` [PATCH 02/10] home-cluster while creating an array gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-29  1:36   ` NeilBrown
  2015-04-29  1:41   ` NeilBrown
  2015-04-24  7:30 ` [PATCH 04/10] Show all bitmaps while examining bitmap gqjiang
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

For a clustered MD, create bitmaps equal to number of nodes so
each node has an independent bitmap.

Only the first bitmap is has the bits set so that the first node
that assembles the device also performs the sync.

The bitmaps are aligned to 4k boundaries.

On-disk format:

0                    4k                     8k                    12k
-------------------------------------------------------------------
| idle                | md super            | bm super [0] + bits |
| bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
| bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
| bm bits [3, contd]  |                     |                     |

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Create.c   |  3 ++-
 bitmap.h   |  7 +++++--
 mdadm.8.in |  7 ++++++-
 mdadm.c    | 17 ++++++++++++++++-
 super1.c   | 59 +++++++++++++++++++++++++++++++++++++++++------------------
 5 files changed, 70 insertions(+), 23 deletions(-)

diff --git a/Create.c b/Create.c
index cd5485b..9663dc4 100644
--- a/Create.c
+++ b/Create.c
@@ -752,7 +752,8 @@ int Create(struct supertype *st, char *mddev,
 #endif
 	}
 
-	if (s->bitmap_file && strcmp(s->bitmap_file, "internal")==0) {
+	if (s->bitmap_file && (strcmp(s->bitmap_file, "internal")==0
+			|| strcmp(s->bitmap_file, "clustered")==0)) {
 		if ((vers%100) < 2) {
 			pr_err("internal bitmaps not supported by this kernel.\n");
 			goto abort_locked;
diff --git a/bitmap.h b/bitmap.h
index c8725a3..adbf0b4 100644
--- a/bitmap.h
+++ b/bitmap.h
@@ -154,8 +154,11 @@ typedef struct bitmap_super_s {
 	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
 	__u32 daemon_sleep; /* 56  seconds between disk flushes */
 	__u32 write_behind; /* 60  number of outstanding write-behind writes */
-
-	__u8  pad[256 - 64]; /* set to zero */
+	__u32 sectors_reserved; /* 64 number of 512-byte sectors that are
+				 * reserved for the bitmap. */
+	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
+	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
+	__u8  pad[256 - 136]; /* set to zero */
 } bitmap_super_t;
 
 /* notes:
diff --git a/mdadm.8.in b/mdadm.8.in
index a0e8288..c015cbf 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -700,7 +700,12 @@ and so is replicated on all devices.  If the word
 .B "none"
 is given with
 .B \-\-grow
-mode, then any bitmap that is present is removed.
+mode, then any bitmap that is present is removed. If the word
+.B "clustered"
+is given, the array is created for a clustered environment. One bitmap
+is created for each node as defined by the
+.B \-\-nodes
+parameter and are stored internally.
 
 To help catch typing errors, the filename must contain at least one
 slash ('/') if it is a real file (not 'internal' or 'none').
diff --git a/mdadm.c b/mdadm.c
index e4f8568..6963a09 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1111,6 +1111,15 @@ int main(int argc, char *argv[])
 				s.bitmap_file = optarg;
 				continue;
 			}
+			if (strcmp(optarg, "clustered")== 0) {
+				s.bitmap_file = optarg;
+				/* Set the default number of cluster nodes
+				 * to 4 if not already set by user
+				 */
+				if (c.nodes < 1)
+					c.nodes = 4;
+				continue;
+			}
 			/* probable typo */
 			pr_err("bitmap file must contain a '/', or be 'internal', or 'none'\n"
 				"       not '%s'\n", optarg);
@@ -1404,7 +1413,13 @@ int main(int argc, char *argv[])
 		if (c.delay == 0)
 			c.delay = DEFAULT_BITMAP_DELAY;
 
-		if (!strncmp(s.bitmap_file, "internal", 9) ||
+		if (!strncmp(s.bitmap_file, "clustered", 9)) {
+			if (s.level != 1) {
+				pr_err("--bitmap=clustered is currently supported with RAID mirror only\n");
+				rv = 1;
+				break;
+			}
+		} else if (!strncmp(s.bitmap_file, "internal", 9) ||
 			!strncmp(s.bitmap_file,"none", 4)) {
 			if (c.nodes) {
 				pr_err("--nodes argument is incompatible with --bitmap=%s.\n",
diff --git a/super1.c b/super1.c
index f0508fe..ac1b011 100644
--- a/super1.c
+++ b/super1.c
@@ -2144,6 +2144,10 @@ add_internal_bitmap1(struct supertype *st,
 	bms->daemon_sleep = __cpu_to_le32(delay);
 	bms->sync_size = __cpu_to_le64(size);
 	bms->write_behind = __cpu_to_le32(write_behind);
+	bms->nodes = __cpu_to_le32(st->nodes);
+	if (st->cluster_name)
+		strncpy((char *)bms->cluster_name,
+				st->cluster_name, strlen(st->cluster_name));
 
 	*chunkp = chunk;
 	return 1;
@@ -2177,6 +2181,7 @@ static int write_bitmap1(struct supertype *st, int fd)
 	void *buf;
 	int towrite, n;
 	struct align_fd afd;
+	unsigned int i;
 
 	init_afd(&afd, fd);
 
@@ -2185,27 +2190,45 @@ static int write_bitmap1(struct supertype *st, int fd)
 	if (posix_memalign(&buf, 4096, 4096))
 		return -ENOMEM;
 
-	memset(buf, 0xff, 4096);
-	memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
-
-	towrite = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
-	towrite = (towrite+7) >> 3; /* bits to bytes */
-	towrite += sizeof(bitmap_super_t);
-	towrite = ROUND_UP(towrite, 512);
-	while (towrite > 0) {
-		n = towrite;
-		if (n > 4096)
-			n = 4096;
-		n = awrite(&afd, buf, n);
-		if (n > 0)
-			towrite -= n;
+	/* We use bms->nodes as opposed to st->nodes to
+	 * be compatible with write-after-reads such as
+	 * the GROW operation.
+	 */
+	for (i = 0; i < __le32_to_cpu(bms->nodes); i++) {
+		/* Only the first bitmap should resync
+		 * the whole device
+		 */
+		if (i)
+			memset(buf, 0x00, 4096);
 		else
+			memset(buf, 0xff, 4096);
+		memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
+
+		towrite = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
+		towrite = (towrite+7) >> 3; /* bits to bytes */
+		towrite += sizeof(bitmap_super_t);
+		/* we need the bitmaps to be at 4k boundary */
+		towrite = ROUND_UP(towrite, 4096);
+		while (towrite > 0) {
+			n = towrite;
+			if (n > 4096)
+				n = 4096;
+			n = awrite(&afd, buf, n);
+			if (n > 0)
+				towrite -= n;
+			else
+				break;
+			if (i)
+				memset(buf, 0x00, 4096);
+			else
+				memset(buf, 0xff, 4096);
+		}
+		fsync(fd);
+		if (towrite) {
+			rv = -2;
 			break;
-		memset(buf, 0xff, 4096);
+		}
 	}
-	fsync(fd);
-	if (towrite)
-		rv = -2;
 
 	free(buf);
 	return rv;
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 04/10] Show all bitmaps while examining bitmap
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
                   ` (2 preceding siblings ...)
  2015-04-24  7:30 ` [PATCH 03/10] Create n bitmaps for clustered mode gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-29  1:41   ` NeilBrown
  2015-04-24  7:30 ` [PATCH 05/10] Add a new clustered disk gqjiang
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

This adds capability of exmining bitmaps corresponding to all
nodes/slots on the device.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 bitmap.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/bitmap.c b/bitmap.c
index b1d54a6..ab83f4e 100644
--- a/bitmap.c
+++ b/bitmap.c
@@ -258,7 +258,7 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
 	int rv = 1;
 	char buf[64];
 	int swap;
-	int fd;
+	int fd, i;
 	__u32 uuid32[4];
 
 	fd = bitmap_file_open(filename, &st);
@@ -315,9 +315,6 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
 		       uuid32[2],
 		       uuid32[3]);
 
-	printf("          Events : %llu\n", (unsigned long long)sb->events);
-	printf("  Events Cleared : %llu\n", (unsigned long long)sb->events_cleared);
-	printf("           State : %s\n", bitmap_state(sb->state));
 	printf("       Chunksize : %s\n", human_chunksize(sb->chunksize));
 	printf("          Daemon : %ds flush period\n", sb->daemon_sleep);
 	if (sb->write_behind)
@@ -327,11 +324,31 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
 	printf("      Write Mode : %s\n", buf);
 	printf("       Sync Size : %llu%s\n", (unsigned long long)sb->sync_size/2,
 					human_size(sb->sync_size * 512));
-	if (brief)
-		goto free_info;
-	printf("          Bitmap : %llu bits (chunks), %llu dirty (%2.1f%%)\n",
-			info->total_bits, info->dirty_bits,
-			100.0 * info->dirty_bits / (info->total_bits?:1));
+	if (sb->nodes > 0) {
+		printf("   Cluster nodes : %d\n", sb->nodes);
+		printf("    Cluster name : %s\n", sb->cluster_name);
+	}
+	i = 0;
+	do {
+		if (i) {
+			free(info);
+			info = bitmap_fd_read(fd, brief);
+			sb = &info->sb;
+		}
+		if (sb->magic != BITMAP_MAGIC) {
+			pr_err("invalid bitmap magic 0x%x, the bitmap file appears to be corrupted\n", sb->magic);
+		}
+		printf("       Node Slot : %d\n", i);
+		printf("          Events : %llu\n", (unsigned long long)sb->events);
+		printf("  Events Cleared : %llu\n", (unsigned long long)sb->events_cleared);
+		printf("           State : %s\n", bitmap_state(sb->state));
+		if (brief)
+			continue;
+		printf("          Bitmap : %llu bits (chunks), %llu dirty (%2.1f%%)\n",
+				info->total_bits, info->dirty_bits,
+				100.0 * info->dirty_bits / (info->total_bits?:1));
+	} while (++i < (int)sb->nodes);
+
 free_info:
 	free(info);
 	return rv;
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 05/10] Add a new clustered disk
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
                   ` (3 preceding siblings ...)
  2015-04-24  7:30 ` [PATCH 04/10] Show all bitmaps while examining bitmap gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-29  1:45   ` NeilBrown
  2015-04-24  7:30 ` [PATCH 06/10] Convert a bitmap=none device to clustered gqjiang
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

A clustered disk is added by the traditional --add sequence.
However, other nodes need to acknowledge that they can "see"
the device. This is done by --cluster-confirm:

--cluster-confirm Y:/dev/whatever (if disk is found)
or
--cluster-confirm Y:missing (if disk is not found)

The node initiating the --add, has the disk state tagged with
MD_DISK_CLUSTER_ADD and the one confirming tag the disk with
MD_DISK_CANDIDATE.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Manage.c   | 33 +++++++++++++++++++++++++++++----
 ReadMe.c   |  1 +
 md_p.h     |  7 +++++++
 md_u.h     |  1 +
 mdadm.8.in |  9 +++++++++
 mdadm.c    |  4 ++++
 mdadm.h    |  2 ++
 util.c     | 11 +++++++++++
 8 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/Manage.c b/Manage.c
index d3cfb55..4c3d451 100644
--- a/Manage.c
+++ b/Manage.c
@@ -690,7 +690,8 @@ skip_re_add:
 int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 	       struct supertype *tst, mdu_array_info_t *array,
 	       int force, int verbose, char *devname,
-	       char *update, unsigned long rdev, unsigned long long array_size)
+	       char *update, unsigned long rdev, unsigned long long array_size,
+	       int raid_slot)
 {
 	unsigned long long ldsize;
 	struct supertype *dev_st = NULL;
@@ -879,7 +880,10 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 	}
 	disc.major = major(rdev);
 	disc.minor = minor(rdev);
-	disc.number =j;
+	if (raid_slot < 0)
+		disc.number = j;
+	else
+		disc.number = raid_slot;
 	disc.state = 0;
 	if (array->not_persistent==0) {
 		int dfd;
@@ -920,6 +924,14 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
 			}
 		free(used);
 	}
+
+	if (array->state & (1 << MD_SB_CLUSTERED)) {
+		if (dv->disposition == 'c')
+			disc.state |= (1 << MD_DISK_CANDIDATE);
+		else
+			disc.state |= (1 << MD_DISK_CLUSTER_ADD);
+	}
+
 	if (dv->writemostly == 1)
 		disc.state |= (1 << MD_DISK_WRITEMOSTLY);
 	if (tst->ss->external) {
@@ -1239,6 +1251,7 @@ int Manage_subdevs(char *devname, int fd,
 	 *        variant on 'A'
 	 *  'F' - Another variant of 'A', where the device was faulty
 	 *        so must be removed from the array first.
+	 *  'c' - confirm the device as found (for clustered environments)
 	 *
 	 * For 'f' and 'r', the device can also be a kernel-internal
 	 * name such as 'sdb'.
@@ -1254,6 +1267,7 @@ int Manage_subdevs(char *devname, int fd,
 	struct mdinfo info;
 	int frozen = 0;
 	int busy = 0;
+	int raid_slot = -1;
 
 	if (ioctl(fd, GET_ARRAY_INFO, &array)) {
 		pr_err("Cannot get array info for %s\n",
@@ -1282,6 +1296,11 @@ int Manage_subdevs(char *devname, int fd,
 		int rv;
 		int mj,mn;
 
+		raid_slot = -1;
+		if (dv->disposition == 'c')
+			parse_cluster_confirm_arg(dv->devname, &dv->devname,
+					&raid_slot);
+
 		if (strcmp(dv->devname, "failed") == 0 ||
 		    strcmp(dv->devname, "faulty") == 0) {
 			if (dv->disposition != 'A'
@@ -1307,6 +1326,11 @@ int Manage_subdevs(char *devname, int fd,
 		if (strcmp(dv->devname, "missing") == 0) {
 			struct mddev_dev *add_devlist = NULL;
 			struct mddev_dev **dp;
+			if (dv->disposition == 'c') {
+				rv = ioctl(fd, CLUSTERED_DISK_NACK, NULL);
+				break;
+			}
+
 			if (dv->disposition != 'A') {
 				pr_err("'missing' only meaningful with --re-add\n");
 				goto abort;
@@ -1399,7 +1423,7 @@ int Manage_subdevs(char *devname, int fd,
 			else {
 				int open_err = errno;
 				if (stat(dv->devname, &stb) != 0) {
-					pr_err("Cannot find %s: %s\n",
+					pr_err("%s: %d Cannot find %s: %s\n", __func__, __LINE__,
 					       dv->devname, strerror(errno));
 					goto abort;
 				}
@@ -1437,6 +1461,7 @@ int Manage_subdevs(char *devname, int fd,
 		case 'A':
 		case 'M': /* --re-add missing */
 		case 'F': /* --re-add faulty  */
+		case 'c': /* --cluster-confirm */
 			/* add the device */
 			if (subarray) {
 				pr_err("Cannot add disks to a \'member\' array, perform this operation on the parent container\n");
@@ -1470,7 +1495,7 @@ int Manage_subdevs(char *devname, int fd,
 			}
 			rv = Manage_add(fd, tfd, dv, tst, &array,
 					force, verbose, devname, update,
-					rdev, array_size);
+					rdev, array_size, raid_slot);
 			close(tfd);
 			tfd = -1;
 			if (rv < 0)
diff --git a/ReadMe.c b/ReadMe.c
index c6286ae..c854cd5 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -169,6 +169,7 @@ struct option long_options[] = {
     {"wait",	  0, 0,  WaitOpt},
     {"wait-clean", 0, 0, Waitclean },
     {"action",    1, 0, Action },
+    {"cluster-confirm", 0, 0, ClusterConfirm},
 
     /* For Detail/Examine */
     {"brief",	  0, 0, Brief},
diff --git a/md_p.h b/md_p.h
index c4846ba..e59504f 100644
--- a/md_p.h
+++ b/md_p.h
@@ -78,6 +78,12 @@
 #define MD_DISK_ACTIVE		1 /* disk is running but may not be in sync */
 #define MD_DISK_SYNC		2 /* disk is in sync with the raid set */
 #define MD_DISK_REMOVED		3 /* disk is in sync with the raid set */
+#define MD_DISK_CLUSTER_ADD     4 /* Initiate a disk add across the cluster
+				   * For clustered enviroments only.
+				   */
+#define MD_DISK_CANDIDATE	5 /* disk is added as spare (local) until confirmed
+				   * For clustered enviroments only.
+				   */
 
 #define	MD_DISK_WRITEMOSTLY	9 /* disk is "write-mostly" is RAID1 config.
 				   * read requests will only be sent here in
@@ -106,6 +112,7 @@ typedef struct mdp_device_descriptor_s {
 #define MD_SB_BLOCK_CONTAINER_RESHAPE 3 /* block container wide reshapes */
 #define MD_SB_BLOCK_VOLUME	4 /* block activation of array, other arrays
 				   * in container can be activated */
+#define MD_SB_CLUSTERED		5 /* MD is clustered  */
 #define	MD_SB_BITMAP_PRESENT	8 /* bitmap may be present nearby */
 
 typedef struct mdp_superblock_s {
diff --git a/md_u.h b/md_u.h
index be9868a..76068d6 100644
--- a/md_u.h
+++ b/md_u.h
@@ -44,6 +44,7 @@
 #define STOP_ARRAY		_IO (MD_MAJOR, 0x32)
 #define STOP_ARRAY_RO		_IO (MD_MAJOR, 0x33)
 #define RESTART_ARRAY_RW	_IO (MD_MAJOR, 0x34)
+#define CLUSTERED_DISK_NACK	_IO (MD_MAJOR, 0x35)
 
 typedef struct mdu_version_s {
 	int major;
diff --git a/mdadm.8.in b/mdadm.8.in
index c015cbf..6873cc7 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -1405,6 +1405,15 @@ will avoid reading from these devices if possible.
 .BR \-\-readwrite
 Subsequent devices that are added or re\-added will have the 'write-mostly'
 flag cleared.
+.TP
+.BR \-\-cluster\-confirm
+Confirm the existence of the device. This is issued in response to an \-\-add
+request by a node in a cluster. When a node adds a device it sends a message
+to all nodes in the cluster to look for a device with a UUID. This translates
+to a udev notification with the UUID of the device to be added and the slot
+number. The receiving node must acknowledge this message
+with \-\-cluster\-confirm. Valid arguments are <slot>:<devicename> in case
+the device is found or <slot>:missing in case the device is not found.
 
 .P
 Each of these options requires that the first device listed is the array
diff --git a/mdadm.c b/mdadm.c
index 6963a09..5b4b3ef 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -196,6 +196,7 @@ int main(int argc, char *argv[])
 		case 'f':
 		case Fail:
 		case ReAdd: /* re-add */
+		case ClusterConfirm:
 			if (!mode) {
 				newmode = MANAGE;
 				shortopt = short_bitmap_options;
@@ -933,6 +934,9 @@ int main(int argc, char *argv[])
 					   * remove the device */
 			devmode = 'f';
 			continue;
+		case O(MANAGE, ClusterConfirm):
+			devmode = 'c';
+			continue;
 		case O(MANAGE,Replace):
 			/* Mark these devices for replacement */
 			devmode = 'R';
diff --git a/mdadm.h b/mdadm.h
index f56d9d6..00c726e 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -346,6 +346,7 @@ enum special_options {
 	Action,
 	Nodes,
 	ClusterName,
+	ClusterConfirm,
 };
 
 enum prefix_standard {
@@ -1281,6 +1282,7 @@ extern int parse_uuid(char *str, int uuid[4]);
 extern int parse_layout_10(char *layout);
 extern int parse_layout_faulty(char *layout);
 extern long parse_num(char *num);
+extern int parse_cluster_confirm_arg(char *inp, char **devname, int *slot);
 extern int check_ext2(int fd, char *name);
 extern int check_reiser(int fd, char *name);
 extern int check_raid(int fd, char *name);
diff --git a/util.c b/util.c
index ed9a745..1d82fc7 100644
--- a/util.c
+++ b/util.c
@@ -273,6 +273,17 @@ long parse_num(char *num)
 }
 #endif
 
+int parse_cluster_confirm_arg(char *input, char **devname, int *slot)
+{
+	char *dev;
+	*slot = strtoul(input, &dev, 10);
+	if (dev[0] == ':')
+		*devname = dev+1;
+	else
+		return -1;
+	return 0;
+}
+
 void remove_partitions(int fd)
 {
 	/* remove partitions from this block devices.
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 06/10] Convert a bitmap=none device to clustered
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
                   ` (4 preceding siblings ...)
  2015-04-24  7:30 ` [PATCH 05/10] Add a new clustered disk gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-24  7:30 ` [PATCH 07/10] Skip clustered devices in incremental gqjiang
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

This adds the ability to convert a regular md without bitmap
(--bitmap=none) to a clustered device (--bitmap=clustered).

To convert a device with --bitmap=internal or --bitmap=external,
you have to convert to --bitmap=none and then re-execute the
command with --bitmap=clustered.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Grow.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/Grow.c b/Grow.c
index 9a573fd..1122cec 100644
--- a/Grow.c
+++ b/Grow.c
@@ -330,9 +330,16 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 			}
 			return 0;
 		}
-		pr_err("Internal bitmap already present on %s\n",
-			devname);
-		return 1;
+		if ((strcmp(s->bitmap_file, "clustered")==0) && (array.state & (1<<MD_SB_CLUSTERED))) {
+			pr_err("Clustered bitmap already present on %s\n",
+					devname);
+			return 1;
+		}
+		if ((strcmp(s->bitmap_file, "internal")==0) && (!(array.state & (1<<MD_SB_CLUSTERED)))) {
+			pr_err("Internal bitmap already present on %s\n",
+					devname);
+			return 1;
+		}
 	}
 
 	if (strcmp(s->bitmap_file, "none") == 0) {
@@ -375,7 +382,8 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 		free(st);
 		return 1;
 	}
-	if (strcmp(s->bitmap_file, "internal") == 0) {
+	if ((strcmp(s->bitmap_file, "internal") == 0) ||
+		(strcmp(s->bitmap_file, "clustered") == 0)) {
 		int rv;
 		int d;
 		int offset_setable = 0;
@@ -384,6 +392,8 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 			pr_err("Internal bitmaps not supported with %s metadata\n", st->ss->name);
 			return 1;
 		}
+		st->nodes = c->nodes;
+		st->cluster_name = c->homecluster;
 		mdi = sysfs_read(fd, NULL, GET_BITMAP_LOCATION);
 		if (mdi)
 			offset_setable = 1;
@@ -426,6 +436,8 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 			rv = sysfs_set_num_signed(mdi, NULL, "bitmap/location",
 						  mdi->bitmap_offset);
 		} else {
+			if (strcmp(s->bitmap_file, "clustered") == 0)
+				array.state |= (1<<MD_SB_CLUSTERED);
 			array.state |= (1<<MD_SB_BITMAP_PRESENT);
 			rv = ioctl(fd, SET_ARRAY_INFO, &array);
 		}
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 07/10] Skip clustered devices in incremental
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
                   ` (5 preceding siblings ...)
  2015-04-24  7:30 ` [PATCH 06/10] Convert a bitmap=none device to clustered gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-24  7:30 ` [PATCH 08/10] mdadm: add the ability to change cluster name gqjiang
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

We want the clustered devices to be started exclusively by a cluster
resource-agent. So, avoid starting using the incremental option.

This also skips a clustered md from starting during boot in inactive mode.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Incremental.c | 5 +++++
 super1.c      | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/Incremental.c b/Incremental.c
index 0c9a9a4..5450a5c 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -232,6 +232,11 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 				devname);
 		goto out;
 	}
+	/* Skip the clustered ones. This should be started by
+	 * clustering resource agents
+	 */
+	if (info.array.state & (1 << MD_SB_CLUSTERED))
+		goto out;
 
 	/* 3a/ if not, check for homehost match.  If no match, continue
 	 * but don't trust the 'name' in the array. Thus a 'random' minor
diff --git a/super1.c b/super1.c
index ac1b011..3c2fce2 100644
--- a/super1.c
+++ b/super1.c
@@ -891,6 +891,8 @@ static void getinfo_super1(struct supertype *st, struct mdinfo *info, char *map)
 	info->array.state =
 		(__le64_to_cpu(sb->resync_offset) == MaxSector)
 		? 1 : 0;
+	if (__le32_to_cpu(bsb->nodes) > 1)
+		info->array.state |= (1 << MD_SB_CLUSTERED);
 
 	info->data_offset = __le64_to_cpu(sb->data_offset);
 	info->component_size = __le64_to_cpu(sb->size);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 08/10] mdadm: add the ability to change cluster name
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
                   ` (6 preceding siblings ...)
  2015-04-24  7:30 ` [PATCH 07/10] Skip clustered devices in incremental gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-29  1:50   ` NeilBrown
  2015-04-24  7:30 ` [PATCH 09/10] mdadm: change the num of cluster node gqjiang
  2015-04-24  7:30 ` [PATCH 10/10] Reuse the write_bitmap for update uuid gqjiang
  9 siblings, 1 reply; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

To support change the cluster name, the commit do the followings:

1. extend original write_bitmap function for new scenario.
2. add the scenarion to handle the modification of cluster's name
   in write_bitmap1.
3. make update_super1 can change the name in mdp_superblock_1.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Assemble.c |  5 +++++
 Grow.c     |  2 +-
 mdadm.c    |  3 +++
 mdadm.h    |  7 ++++++-
 super0.c   |  4 ++--
 super1.c   | 46 ++++++++++++++++++++++++++++++++++++++++++++--
 6 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 25a103d..e1b846c 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -644,6 +644,11 @@ static int load_devices(struct devs *devices, char *devmap,
 				*stp = st;
 				return -1;
 			}
+			if (strcmp(c->update, "home-cluster") == 0) {
+				err = tst->ss->update_super(tst, content, c->update,
+							    devname, 0, 0, c->homecluster);
+				tst->ss->write_bitmap(tst, dfd, NameUpdate);
+			}
 			if (strcmp(c->update, "uuid")==0 &&
 			    !ident->uuid_set) {
 				ident->uuid_set = 1;
diff --git a/Grow.c b/Grow.c
index 1122cec..bf44e66 100644
--- a/Grow.c
+++ b/Grow.c
@@ -420,7 +420,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
 						    bitmapsize, offset_setable,
 						    major)
 						)
-						st->ss->write_bitmap(st, fd2);
+						st->ss->write_bitmap(st, fd2, NoUpdate);
 					else {
 						pr_err("failed to create internal bitmap - chunksize problem.\n");
 						close(fd2);
diff --git a/mdadm.c b/mdadm.c
index 5b4b3ef..20f195d 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -598,6 +598,7 @@ int main(int argc, char *argv[])
 			}
 			continue;
 		case O(CREATE, ClusterName):
+		case O(ASSEMBLE, ClusterName):
 			c.homecluster = optarg;
 			if (strlen(c.homecluster) > 64) {
 				pr_err("Cluster name too big.\n");
@@ -741,6 +742,8 @@ int main(int argc, char *argv[])
 				continue;
 			if (strcmp(c.update, "homehost")==0)
 				continue;
+			if (strcmp(c.update, "home-cluster")==0)
+				continue;
 			if (strcmp(c.update, "devicesize")==0)
 				continue;
 			if (strcmp(c.update, "no-bitmap")==0)
diff --git a/mdadm.h b/mdadm.h
index 00c726e..d8b0749 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -354,6 +354,11 @@ enum prefix_standard {
 	IEC
 };
 
+enum bitmap_update {
+    NoUpdate,
+    NameUpdate,
+};
+
 /* structures read from config file */
 /* List of mddevice names and identifiers
  * Identifiers can be:
@@ -850,7 +855,7 @@ extern struct superswitch {
 	/* if add_internal_bitmap succeeded for existing array, this
 	 * writes it out.
 	 */
-	int (*write_bitmap)(struct supertype *st, int fd);
+	int (*write_bitmap)(struct supertype *st, int fd, enum bitmap_update update);
 	/* Free the superblock and any other allocated data */
 	void (*free_super)(struct supertype *st);
 
diff --git a/super0.c b/super0.c
index deb5999..6ad9d39 100644
--- a/super0.c
+++ b/super0.c
@@ -900,7 +900,7 @@ static int write_init_super0(struct supertype *st)
 		rv = store_super0(st, di->fd);
 
 		if (rv == 0 && (sb->state & (1<<MD_SB_BITMAP_PRESENT)))
-			rv = st->ss->write_bitmap(st, di->fd);
+			rv = st->ss->write_bitmap(st, di->fd, NoUpdate);
 
 		if (rv)
 			pr_err("failed to write superblock to %s\n",
@@ -1175,7 +1175,7 @@ static void locate_bitmap0(struct supertype *st, int fd)
 	lseek64(fd, offset, 0);
 }
 
-static int write_bitmap0(struct supertype *st, int fd)
+static int write_bitmap0(struct supertype *st, int fd, enum bitmap_update update)
 {
 	unsigned long long dsize;
 	unsigned long long offset;
diff --git a/super1.c b/super1.c
index 3c2fce2..e43bef1 100644
--- a/super1.c
+++ b/super1.c
@@ -1073,6 +1073,25 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		info->name[32] = 0;
 	}
 
+	if (strcmp(update, "home-cluster") == 0 &&
+	    homehost) {
+		/* Note that 'home-cluster' is to change the name of cluster,
+		 * it is another "name" update.
+		 */
+		char *new_name = xmalloc(sizeof(sb->set_name));
+		if (strrchr(sb->set_name, ':')) {
+			strcpy(new_name, strchr(sb->set_name, ':'));
+		}
+
+		memset(sb->set_name, 0, sizeof(sb->set_name));
+		strcpy(sb->set_name, homehost);
+		if (new_name)
+			strcat(sb->set_name, new_name);
+
+		free(new_name);
+		goto out;
+	}
+
 	if (strcmp(update, "force-one")==0) {
 		/* Not enough devices for a working array,
 		 * so bring this one up-to-date
@@ -1313,6 +1332,7 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 	else
 		rv = -1;
 
+out:
 	sb->sb_csum = calc_sb_1_csum(sb);
 	return rv;
 }
@@ -1691,7 +1711,7 @@ static int write_init_super1(struct supertype *st)
 		sb->sb_csum = calc_sb_1_csum(sb);
 		rv = store_super1(st, di->fd);
 		if (rv == 0 && (__le32_to_cpu(sb->feature_map) & 1))
-			rv = st->ss->write_bitmap(st, di->fd);
+			rv = st->ss->write_bitmap(st, di->fd, NoUpdate);
 		close(di->fd);
 		di->fd = -1;
 		if (rv)
@@ -2175,7 +2195,7 @@ static void locate_bitmap1(struct supertype *st, int fd)
 	lseek64(fd, offset<<9, 0);
 }
 
-static int write_bitmap1(struct supertype *st, int fd)
+static int write_bitmap1(struct supertype *st, int fd, enum bitmap_update update)
 {
 	struct mdp_superblock_1 *sb = st->sb;
 	bitmap_super_t *bms = (bitmap_super_t*)(((char*)sb)+MAX_SB_SIZE);
@@ -2185,6 +2205,28 @@ static int write_bitmap1(struct supertype *st, int fd)
 	struct align_fd afd;
 	unsigned int i;
 
+	switch (update) {
+	case NameUpdate:
+	{
+	    char *new_name = xmalloc(sizeof(sb->set_name));
+
+	    strncpy(new_name, sb->set_name, sizeof(sb->set_name));
+	    memset((char *)bms->cluster_name, 0, sizeof(bms->cluster_name));
+
+	    if (strtok(new_name, ":"))
+		strncpy((char *)bms->cluster_name, new_name, strlen(sb->set_name));
+	    else
+		/* In case the original set_name doesn't like aaa:md* */
+		strncpy((char *)bms->cluster_name, sb->set_name, strlen(sb->set_name));
+
+	    free(new_name);
+	    break;
+	}
+	case NoUpdate:
+	default:
+	    break;
+	}
+
 	init_afd(&afd, fd);
 
 	locate_bitmap1(st, fd);
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 09/10] mdadm: change the num of cluster node
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
                   ` (7 preceding siblings ...)
  2015-04-24  7:30 ` [PATCH 08/10] mdadm: add the ability to change cluster name gqjiang
@ 2015-04-24  7:30 ` gqjiang
  2015-04-29  1:51   ` NeilBrown
  2015-04-24  7:30 ` [PATCH 10/10] Reuse the write_bitmap for update uuid gqjiang
  9 siblings, 1 reply; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

This extends nodes option for assemble mode, make the num of
cluster node could be change by user.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Assemble.c | 4 ++++
 ReadMe.c   | 2 +-
 mdadm.c    | 3 +++
 mdadm.h    | 1 +
 super1.c   | 6 ++++++
 5 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/Assemble.c b/Assemble.c
index e1b846c..22042a9 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -649,6 +649,10 @@ static int load_devices(struct devs *devices, char *devmap,
 							    devname, 0, 0, c->homecluster);
 				tst->ss->write_bitmap(tst, dfd, NameUpdate);
 			}
+			if (strcmp(c->update, "nodes") == 0) {
+				tst->nodes = c->nodes;
+				tst->ss->write_bitmap(tst, dfd, NodeNumUpdate);
+			}
 			if (strcmp(c->update, "uuid")==0 &&
 			    !ident->uuid_set) {
 				ident->uuid_set = 1;
diff --git a/ReadMe.c b/ReadMe.c
index c854cd5..d1830e1 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -140,7 +140,7 @@ struct option long_options[] = {
     {"homehost",  1, 0,  HomeHost},
     {"symlinks",  1, 0,  Symlinks},
     {"data-offset",1, 0, DataOffset},
-    {"nodes",1, 0, Nodes},
+    {"nodes",1, 0, Nodes}, /* also for --assemble */
     {"home-cluster",1, 0, ClusterName},
 
     /* For assemble */
diff --git a/mdadm.c b/mdadm.c
index 20f195d..344bde2 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -589,6 +589,7 @@ int main(int argc, char *argv[])
 			}
 			ident.raid_disks = s.raiddisks;
 			continue;
+		case O(ASSEMBLE, Nodes):
 		case O(CREATE, Nodes):
 			c.nodes = parse_num(optarg);
 			if (c.nodes <= 0) {
@@ -744,6 +745,8 @@ int main(int argc, char *argv[])
 				continue;
 			if (strcmp(c.update, "home-cluster")==0)
 				continue;
+			if (strcmp(c.update, "nodes")==0)
+				continue;
 			if (strcmp(c.update, "devicesize")==0)
 				continue;
 			if (strcmp(c.update, "no-bitmap")==0)
diff --git a/mdadm.h b/mdadm.h
index d8b0749..97892e6 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -357,6 +357,7 @@ enum prefix_standard {
 enum bitmap_update {
     NoUpdate,
     NameUpdate,
+    NodeNumUpdate,
 };
 
 /* structures read from config file */
diff --git a/super1.c b/super1.c
index e43bef1..047e799 100644
--- a/super1.c
+++ b/super1.c
@@ -1329,6 +1329,9 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
 		sb->devflags |= WriteMostly1;
 	else if (strcmp(update, "readwrite")==0)
 		sb->devflags &= ~WriteMostly1;
+	else if (strcmp(update, "nodes")==0)
+		/* Just a placeholder since no related member in mdp_superblock_1 */
+		;
 	else
 		rv = -1;
 
@@ -2222,6 +2225,9 @@ static int write_bitmap1(struct supertype *st, int fd, enum bitmap_update update
 	    free(new_name);
 	    break;
 	}
+	case NodeNumUpdate:
+	    bms->nodes = __cpu_to_le32(st->nodes);
+	    break;
 	case NoUpdate:
 	default:
 	    break;
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 10/10] Reuse the write_bitmap for update uuid
  2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
                   ` (8 preceding siblings ...)
  2015-04-24  7:30 ` [PATCH 09/10] mdadm: change the num of cluster node gqjiang
@ 2015-04-24  7:30 ` gqjiang
  9 siblings, 0 replies; 31+ messages in thread
From: gqjiang @ 2015-04-24  7:30 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, rgoldwyn

From: Guoqing Jiang <gqjiang@suse.com>

Since write_bitmap is extended for handle different sistuations,
then it also could possible to support change the uuid of bitmap,
and remove bitmap_update_uuid accordingly.

Q: is the write_bitmap0 also impacted?

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Assemble.c |  5 ++---
 bitmap.c   | 20 --------------------
 mdadm.h    |  2 +-
 super1.c   |  4 ++++
 4 files changed, 7 insertions(+), 24 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 22042a9..314e4b4 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -664,9 +664,8 @@ static int load_devices(struct devs *devices, char *devmap,
 
 			if (strcmp(c->update, "uuid")==0 &&
 			    ident->bitmap_fd >= 0 && !bitmap_done) {
-				if (bitmap_update_uuid(ident->bitmap_fd,
-						       content->uuid,
-						       tst->ss->swapuuid) != 0)
+				copy_uuid(tst->devs->uuid, content->uuid, tst->ss->swapuuid);
+				if (tst->ss->write_bitmap(tst, dfd, UUIDUpdate))
 					pr_err("Could not update uuid on external bitmap.\n");
 				else
 					bitmap_done = 1;
diff --git a/bitmap.c b/bitmap.c
index ab83f4e..cd65050 100644
--- a/bitmap.c
+++ b/bitmap.c
@@ -441,23 +441,3 @@ out:
 		unlink(filename); /* possibly corrupted, better get rid of it */
 	return rv;
 }
-
-int bitmap_update_uuid(int fd, int *uuid, int swap)
-{
-	struct bitmap_super_s bm;
-	if (lseek(fd, 0, 0) != 0)
-		return 1;
-	if (read(fd, &bm, sizeof(bm)) != sizeof(bm))
-		return 1;
-	if (bm.magic != __cpu_to_le32(BITMAP_MAGIC))
-		return 1;
-	copy_uuid(bm.uuid, uuid, swap);
-	if (lseek(fd, 0, 0) != 0)
-		return 2;
-	if (write(fd, &bm, sizeof(bm)) != sizeof(bm)) {
-		lseek(fd, 0, 0);
-		return 2;
-	}
-	lseek(fd, 0, 0);
-	return 0;
-}
diff --git a/mdadm.h b/mdadm.h
index 97892e6..7b9bb28 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -358,6 +358,7 @@ enum bitmap_update {
     NoUpdate,
     NameUpdate,
     NodeNumUpdate,
+    UUIDUpdate,
 };
 
 /* structures read from config file */
@@ -1273,7 +1274,6 @@ extern int CreateBitmap(char *filename, int force, char uuid[16],
 			int major);
 extern int ExamineBitmap(char *filename, int brief, struct supertype *st);
 extern int Write_rules(char *rule_name);
-extern int bitmap_update_uuid(int fd, int *uuid, int swap);
 extern unsigned long bitmap_sectors(struct bitmap_super_s *bsb);
 extern int Dump_metadata(char *dev, char *dir, struct context *c,
 			 struct supertype *st);
diff --git a/super1.c b/super1.c
index 047e799..b2fc6d7 100644
--- a/super1.c
+++ b/super1.c
@@ -2228,6 +2228,10 @@ static int write_bitmap1(struct supertype *st, int fd, enum bitmap_update update
 	case NodeNumUpdate:
 	    bms->nodes = __cpu_to_le32(st->nodes);
 	    break;
+	case UUIDUpdate:
+	    memset((char *)bms->uuid, 0, sizeof(bms->uuid));
+	    strncpy((char *)bms->uuid, (char *)st->devs->uuid, sizeof(bms->uuid));
+	    break;
 	case NoUpdate:
 	default:
 	    break;
-- 
1.7.12.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/10] Add nodes option while creating md
  2015-04-24  7:30 ` [PATCH 01/10] Add nodes option while creating md gqjiang
@ 2015-04-29  1:30   ` NeilBrown
  2015-04-30  2:33     ` Guoqing Jiang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-29  1:30 UTC (permalink / raw)
  To: gqjiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 4442 bytes --]

On Fri, 24 Apr 2015 15:30:32 +0800 gqjiang@suse.com wrote:

> From: Guoqing Jiang <gqjiang@suse.com>
> 
> Specifies the maximum number of nodes in the cluster that may use
> this device simultaneously. This is equivalent to the number of
> bitmaps created in the internal superblock (patches to follow).
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>

This doesn't really make much sense coming first in the series.  It sets up a
value that is never used.

I would much rather 03/10 came first - which would just create 4 bitmaps.
Then have this patch to allow that '4' to be changed.


> ---
>  Create.c   |  1 +
>  ReadMe.c   |  1 +
>  mdadm.8.in |  5 +++++
>  mdadm.c    | 20 +++++++++++++++++++-
>  mdadm.h    |  3 +++
>  5 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/Create.c b/Create.c
> index ef28da0..b73f6cb 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -531,6 +531,7 @@ int Create(struct supertype *st, char *mddev,
>  				st->ss->name);
>  		warn = 1;
>  	}
> +	st->nodes = c->nodes;
>  
>  	if (warn) {
>  		if (c->runstop!= 1) {
> diff --git a/ReadMe.c b/ReadMe.c
> index 87a4916..30c569d 100644
> --- a/ReadMe.c
> +++ b/ReadMe.c
> @@ -140,6 +140,7 @@ struct option long_options[] = {
>      {"homehost",  1, 0,  HomeHost},
>      {"symlinks",  1, 0,  Symlinks},
>      {"data-offset",1, 0, DataOffset},
> +    {"nodes",1, 0, Nodes},
>  
>      /* For assemble */
>      {"uuid",      1, 0, 'u'},
> diff --git a/mdadm.8.in b/mdadm.8.in
> index a630310..bd8d59e 100644
> --- a/mdadm.8.in
> +++ b/mdadm.8.in
> @@ -966,6 +966,11 @@ However for RAID0, it is not possible to add spares.  So to increase
>  the number of devices in a RAID0, it is necessary to set the new
>  number of devices, and to add the new devices, in the same command.
>  
> +.TP
> +.BR \-\-nodes
> +Specify the maximum number of nodes in the cluster that will use this
> +device simultaneously. If not specified, this defaults to 4.
> +

I think this should mention that it only makes sense with --bitmap=cluster
(which of course isn't available at this point....)


>  .SH For assemble:
>  
>  .TP
> diff --git a/mdadm.c b/mdadm.c
> index 3e8c49b..bce6a76 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -588,7 +588,14 @@ int main(int argc, char *argv[])
>  			}
>  			ident.raid_disks = s.raiddisks;
>  			continue;
> -
> +		case O(CREATE, Nodes):
> +			c.nodes = parse_num(optarg);
> +			if (c.nodes <= 0) {
> +				pr_err("invalid number for the number of "
> +						"cluster nodes: %s\n", optarg);
> +				exit(2);
> +			}
> +			continue;
>  		case O(CREATE,'x'): /* number of spare (eXtra) disks */
>  			if (s.sparedisks) {
>  				pr_err("spare-devices set twice: %d and %s\n",
> @@ -1377,6 +1384,17 @@ int main(int argc, char *argv[])
>  	case CREATE:
>  		if (c.delay == 0)
>  			c.delay = DEFAULT_BITMAP_DELAY;
> +
> +		if (!strncmp(s.bitmap_file, "internal", 9) ||
> +			!strncmp(s.bitmap_file,"none", 4)) {

I'm sorry but I absolutely *hate* this construct.
The '!' at the front makes it seem like you are testing that the string does
*not* have that value, but it is exactly the reverse.
And why "strncmp" rather than "strcmp" ??

Please use

    if (strcmp(s.bitmap_file, "internal") == 0 ||
        strcmp(s.bitmap_file, "none) == 0) {

except.... what about if bitmap_file in /path/to/somewhere.  Presumably you
want to exclude that case too.
May be
   if (strcmp(s.bitmap_file, "cluster") != 0) {
??

Thanks,
NeilBrown


> +			if (c.nodes) {
> +				pr_err("--nodes argument is incompatible with --bitmap=%s.\n",
> +					s.bitmap_file);
> +				rv = 1;
> +				break;
> +			}
> +		}
> +
>  		if (s.write_behind && !s.bitmap_file) {
>  			pr_err("write-behind mode requires a bitmap.\n");
>  			rv = 1;
> diff --git a/mdadm.h b/mdadm.h
> index 141f963..9d55801 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -344,6 +344,7 @@ enum special_options {
>  	Dump,
>  	Restore,
>  	Action,
> +	Nodes,
>  };
>  
>  enum prefix_standard {
> @@ -418,6 +419,7 @@ struct context {
>  	char	*backup_file;
>  	int	invalid_backup;
>  	char	*action;
> +	int	nodes;
>  };
>  
>  struct shape {
> @@ -1029,6 +1031,7 @@ struct supertype {
>  			 */
>  	int devcnt;
>  	int retry_soon;
> +	int nodes;
>  
>  	struct mdinfo *devs;
>  


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/10] Create n bitmaps for clustered mode
  2015-04-24  7:30 ` [PATCH 03/10] Create n bitmaps for clustered mode gqjiang
@ 2015-04-29  1:36   ` NeilBrown
  2015-04-29  2:41     ` Goldwyn Rodrigues
  2015-04-29  1:41   ` NeilBrown
  1 sibling, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-29  1:36 UTC (permalink / raw)
  To: gqjiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 7113 bytes --]

On Fri, 24 Apr 2015 15:30:34 +0800 gqjiang@suse.com wrote:

> From: Guoqing Jiang <gqjiang@suse.com>
> 
> For a clustered MD, create bitmaps equal to number of nodes so
> each node has an independent bitmap.
> 
> Only the first bitmap is has the bits set so that the first node
> that assembles the device also performs the sync.
> 
> The bitmaps are aligned to 4k boundaries.
> 
> On-disk format:
> 
> 0                    4k                     8k                    12k
> -------------------------------------------------------------------
> | idle                | md super            | bm super [0] + bits |
> | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
> | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
> | bm bits [3, contd]  |                     |                     |
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  Create.c   |  3 ++-
>  bitmap.h   |  7 +++++--
>  mdadm.8.in |  7 ++++++-
>  mdadm.c    | 17 ++++++++++++++++-
>  super1.c   | 59 +++++++++++++++++++++++++++++++++++++++++------------------
>  5 files changed, 70 insertions(+), 23 deletions(-)
> 
> diff --git a/Create.c b/Create.c
> index cd5485b..9663dc4 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -752,7 +752,8 @@ int Create(struct supertype *st, char *mddev,
>  #endif
>  	}
>  
> -	if (s->bitmap_file && strcmp(s->bitmap_file, "internal")==0) {
> +	if (s->bitmap_file && (strcmp(s->bitmap_file, "internal")==0
> +			|| strcmp(s->bitmap_file, "clustered")==0)) {
>  		if ((vers%100) < 2) {
>  			pr_err("internal bitmaps not supported by this kernel.\n");
>  			goto abort_locked;
> diff --git a/bitmap.h b/bitmap.h
> index c8725a3..adbf0b4 100644
> --- a/bitmap.h
> +++ b/bitmap.h
> @@ -154,8 +154,11 @@ typedef struct bitmap_super_s {
>  	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
>  	__u32 daemon_sleep; /* 56  seconds between disk flushes */
>  	__u32 write_behind; /* 60  number of outstanding write-behind writes */
> -
> -	__u8  pad[256 - 64]; /* set to zero */
> +	__u32 sectors_reserved; /* 64 number of 512-byte sectors that are
> +				 * reserved for the bitmap. */
> +	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
> +	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
> +	__u8  pad[256 - 136]; /* set to zero */
>  } bitmap_super_t;
>  
>  /* notes:
> diff --git a/mdadm.8.in b/mdadm.8.in
> index a0e8288..c015cbf 100644
> --- a/mdadm.8.in
> +++ b/mdadm.8.in
> @@ -700,7 +700,12 @@ and so is replicated on all devices.  If the word
>  .B "none"
>  is given with
>  .B \-\-grow
> -mode, then any bitmap that is present is removed.
> +mode, then any bitmap that is present is removed. If the word
> +.B "clustered"
> +is given, the array is created for a clustered environment. One bitmap
> +is created for each node as defined by the
> +.B \-\-nodes
> +parameter and are stored internally.
>  
>  To help catch typing errors, the filename must contain at least one
>  slash ('/') if it is a real file (not 'internal' or 'none').
> diff --git a/mdadm.c b/mdadm.c
> index e4f8568..6963a09 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -1111,6 +1111,15 @@ int main(int argc, char *argv[])
>  				s.bitmap_file = optarg;
>  				continue;
>  			}
> +			if (strcmp(optarg, "clustered")== 0) {
> +				s.bitmap_file = optarg;
> +				/* Set the default number of cluster nodes
> +				 * to 4 if not already set by user
> +				 */
> +				if (c.nodes < 1)
> +					c.nodes = 4;
> +				continue;
> +			}
>  			/* probable typo */
>  			pr_err("bitmap file must contain a '/', or be 'internal', or 'none'\n"
>  				"       not '%s'\n", optarg);
> @@ -1404,7 +1413,13 @@ int main(int argc, char *argv[])
>  		if (c.delay == 0)
>  			c.delay = DEFAULT_BITMAP_DELAY;
>  
> -		if (!strncmp(s.bitmap_file, "internal", 9) ||
> +		if (!strncmp(s.bitmap_file, "clustered", 9)) {
> +			if (s.level != 1) {
> +				pr_err("--bitmap=clustered is currently supported with RAID mirror only\n");
> +				rv = 1;
> +				break;
> +			}
> +		} else if (!strncmp(s.bitmap_file, "internal", 9) ||
>  			!strncmp(s.bitmap_file,"none", 4)) {
>  			if (c.nodes) {
>  				pr_err("--nodes argument is incompatible with --bitmap=%s.\n",
> diff --git a/super1.c b/super1.c
> index f0508fe..ac1b011 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -2144,6 +2144,10 @@ add_internal_bitmap1(struct supertype *st,
>  	bms->daemon_sleep = __cpu_to_le32(delay);
>  	bms->sync_size = __cpu_to_le64(size);
>  	bms->write_behind = __cpu_to_le32(write_behind);
> +	bms->nodes = __cpu_to_le32(st->nodes);
> +	if (st->cluster_name)
> +		strncpy((char *)bms->cluster_name,
> +				st->cluster_name, strlen(st->cluster_name));
>  
>  	*chunkp = chunk;
>  	return 1;
> @@ -2177,6 +2181,7 @@ static int write_bitmap1(struct supertype *st, int fd)
>  	void *buf;
>  	int towrite, n;
>  	struct align_fd afd;
> +	unsigned int i;
>  
>  	init_afd(&afd, fd);
>  
> @@ -2185,27 +2190,45 @@ static int write_bitmap1(struct supertype *st, int fd)
>  	if (posix_memalign(&buf, 4096, 4096))
>  		return -ENOMEM;
>  
> -	memset(buf, 0xff, 4096);
> -	memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
> -
> -	towrite = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
> -	towrite = (towrite+7) >> 3; /* bits to bytes */
> -	towrite += sizeof(bitmap_super_t);
> -	towrite = ROUND_UP(towrite, 512);
> -	while (towrite > 0) {
> -		n = towrite;
> -		if (n > 4096)
> -			n = 4096;
> -		n = awrite(&afd, buf, n);
> -		if (n > 0)
> -			towrite -= n;
> +	/* We use bms->nodes as opposed to st->nodes to
> +	 * be compatible with write-after-reads such as
> +	 * the GROW operation.
> +	 */
> +	for (i = 0; i < __le32_to_cpu(bms->nodes); i++) {
> +		/* Only the first bitmap should resync
> +		 * the whole device
> +		 */
> +		if (i)
> +			memset(buf, 0x00, 4096);
>  		else
> +			memset(buf, 0xff, 4096);

Why is the first bitmap initialised to 0x00 and the others to 0xff?
If there is a good reason it should be documented either in a comment in the
code or in the changelog entry.

Thanks,
NeilBrown


> +		memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
> +
> +		towrite = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
> +		towrite = (towrite+7) >> 3; /* bits to bytes */
> +		towrite += sizeof(bitmap_super_t);
> +		/* we need the bitmaps to be at 4k boundary */
> +		towrite = ROUND_UP(towrite, 4096);
> +		while (towrite > 0) {
> +			n = towrite;
> +			if (n > 4096)
> +				n = 4096;
> +			n = awrite(&afd, buf, n);
> +			if (n > 0)
> +				towrite -= n;
> +			else
> +				break;
> +			if (i)
> +				memset(buf, 0x00, 4096);
> +			else
> +				memset(buf, 0xff, 4096);
> +		}
> +		fsync(fd);
> +		if (towrite) {
> +			rv = -2;
>  			break;
> -		memset(buf, 0xff, 4096);
> +		}
>  	}
> -	fsync(fd);
> -	if (towrite)
> -		rv = -2;
>  
>  	free(buf);
>  	return rv;


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/10] Create n bitmaps for clustered mode
  2015-04-24  7:30 ` [PATCH 03/10] Create n bitmaps for clustered mode gqjiang
  2015-04-29  1:36   ` NeilBrown
@ 2015-04-29  1:41   ` NeilBrown
  2015-04-30  2:44     ` Guoqing Jiang
  1 sibling, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-29  1:41 UTC (permalink / raw)
  To: gqjiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 2586 bytes --]

On Fri, 24 Apr 2015 15:30:34 +0800 gqjiang@suse.com wrote:

> From: Guoqing Jiang <gqjiang@suse.com>
> 
> For a clustered MD, create bitmaps equal to number of nodes so
> each node has an independent bitmap.
> 
> Only the first bitmap is has the bits set so that the first node
> that assembles the device also performs the sync.
> 
> The bitmaps are aligned to 4k boundaries.
> 
> On-disk format:
> 
> 0                    4k                     8k                    12k
> -------------------------------------------------------------------
> | idle                | md super            | bm super [0] + bits |
> | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
> | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
> | bm bits [3, contd]  |                     |                     |
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  Create.c   |  3 ++-
>  bitmap.h   |  7 +++++--
>  mdadm.8.in |  7 ++++++-
>  mdadm.c    | 17 ++++++++++++++++-
>  super1.c   | 59 +++++++++++++++++++++++++++++++++++++++++------------------
>  5 files changed, 70 insertions(+), 23 deletions(-)
> 
> diff --git a/Create.c b/Create.c
> index cd5485b..9663dc4 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -752,7 +752,8 @@ int Create(struct supertype *st, char *mddev,
>  #endif
>  	}
>  
> -	if (s->bitmap_file && strcmp(s->bitmap_file, "internal")==0) {
> +	if (s->bitmap_file && (strcmp(s->bitmap_file, "internal")==0
> +			|| strcmp(s->bitmap_file, "clustered")==0)) {
>  		if ((vers%100) < 2) {
>  			pr_err("internal bitmaps not supported by this kernel.\n");
>  			goto abort_locked;
> diff --git a/bitmap.h b/bitmap.h
> index c8725a3..adbf0b4 100644
> --- a/bitmap.h
> +++ b/bitmap.h
> @@ -154,8 +154,11 @@ typedef struct bitmap_super_s {
>  	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
>  	__u32 daemon_sleep; /* 56  seconds between disk flushes */
>  	__u32 write_behind; /* 60  number of outstanding write-behind writes */
> -
> -	__u8  pad[256 - 64]; /* set to zero */
> +	__u32 sectors_reserved; /* 64 number of 512-byte sectors that are
> +				 * reserved for the bitmap. */
> +	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
> +	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
> +	__u8  pad[256 - 136]; /* set to zero */
>  } bitmap_super_t;

I missed this the first time, but these fields that you have added need to be
added to sb_le_to_cpu().

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/10] Show all bitmaps while examining bitmap
  2015-04-24  7:30 ` [PATCH 04/10] Show all bitmaps while examining bitmap gqjiang
@ 2015-04-29  1:41   ` NeilBrown
  2015-04-30  3:17     ` Guoqing Jiang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-29  1:41 UTC (permalink / raw)
  To: gqjiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 2941 bytes --]

On Fri, 24 Apr 2015 15:30:35 +0800 gqjiang@suse.com wrote:

> From: Guoqing Jiang <gqjiang@suse.com>
> 
> This adds capability of exmining bitmaps corresponding to all
> nodes/slots on the device.
> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  bitmap.c | 35 ++++++++++++++++++++++++++---------
>  1 file changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/bitmap.c b/bitmap.c
> index b1d54a6..ab83f4e 100644
> --- a/bitmap.c
> +++ b/bitmap.c
> @@ -258,7 +258,7 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
>  	int rv = 1;
>  	char buf[64];
>  	int swap;
> -	int fd;
> +	int fd, i;
>  	__u32 uuid32[4];
>  
>  	fd = bitmap_file_open(filename, &st);
> @@ -315,9 +315,6 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
>  		       uuid32[2],
>  		       uuid32[3]);
>  
> -	printf("          Events : %llu\n", (unsigned long long)sb->events);
> -	printf("  Events Cleared : %llu\n", (unsigned long long)sb->events_cleared);
> -	printf("           State : %s\n", bitmap_state(sb->state));

Can we please leave this where it is in the case where there is only one node?
Only move it down if there are multiple nodes to report on.

NeilBrown


>  	printf("       Chunksize : %s\n", human_chunksize(sb->chunksize));
>  	printf("          Daemon : %ds flush period\n", sb->daemon_sleep);
>  	if (sb->write_behind)
> @@ -327,11 +324,31 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
>  	printf("      Write Mode : %s\n", buf);
>  	printf("       Sync Size : %llu%s\n", (unsigned long long)sb->sync_size/2,
>  					human_size(sb->sync_size * 512));
> -	if (brief)
> -		goto free_info;
> -	printf("          Bitmap : %llu bits (chunks), %llu dirty (%2.1f%%)\n",
> -			info->total_bits, info->dirty_bits,
> -			100.0 * info->dirty_bits / (info->total_bits?:1));
> +	if (sb->nodes > 0) {
> +		printf("   Cluster nodes : %d\n", sb->nodes);
> +		printf("    Cluster name : %s\n", sb->cluster_name);
> +	}
> +	i = 0;
> +	do {
> +		if (i) {
> +			free(info);
> +			info = bitmap_fd_read(fd, brief);
> +			sb = &info->sb;
> +		}
> +		if (sb->magic != BITMAP_MAGIC) {
> +			pr_err("invalid bitmap magic 0x%x, the bitmap file appears to be corrupted\n", sb->magic);
> +		}
> +		printf("       Node Slot : %d\n", i);
> +		printf("          Events : %llu\n", (unsigned long long)sb->events);
> +		printf("  Events Cleared : %llu\n", (unsigned long long)sb->events_cleared);
> +		printf("           State : %s\n", bitmap_state(sb->state));
> +		if (brief)
> +			continue;
> +		printf("          Bitmap : %llu bits (chunks), %llu dirty (%2.1f%%)\n",
> +				info->total_bits, info->dirty_bits,
> +				100.0 * info->dirty_bits / (info->total_bits?:1));
> +	} while (++i < (int)sb->nodes);
> +
>  free_info:
>  	free(info);
>  	return rv;


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/10] Add a new clustered disk
  2015-04-24  7:30 ` [PATCH 05/10] Add a new clustered disk gqjiang
@ 2015-04-29  1:45   ` NeilBrown
  2015-04-30  3:20     ` Guoqing Jiang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-29  1:45 UTC (permalink / raw)
  To: gqjiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 9758 bytes --]

On Fri, 24 Apr 2015 15:30:36 +0800 gqjiang@suse.com wrote:

> From: Guoqing Jiang <gqjiang@suse.com>
> 
> A clustered disk is added by the traditional --add sequence.
> However, other nodes need to acknowledge that they can "see"
> the device. This is done by --cluster-confirm:
> 
> --cluster-confirm Y:/dev/whatever (if disk is found)
> or
> --cluster-confirm Y:missing (if disk is not found)
> 
> The node initiating the --add, has the disk state tagged with
> MD_DISK_CLUSTER_ADD and the one confirming tag the disk with
> MD_DISK_CANDIDATE.

You haven't explained 'Y' here.  It looks like it means 'Yes', but it doesn't.


> 
> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  Manage.c   | 33 +++++++++++++++++++++++++++++----
>  ReadMe.c   |  1 +
>  md_p.h     |  7 +++++++
>  md_u.h     |  1 +
>  mdadm.8.in |  9 +++++++++
>  mdadm.c    |  4 ++++
>  mdadm.h    |  2 ++
>  util.c     | 11 +++++++++++
>  8 files changed, 64 insertions(+), 4 deletions(-)
> 
> diff --git a/Manage.c b/Manage.c
> index d3cfb55..4c3d451 100644
> --- a/Manage.c
> +++ b/Manage.c
> @@ -690,7 +690,8 @@ skip_re_add:
>  int Manage_add(int fd, int tfd, struct mddev_dev *dv,
>  	       struct supertype *tst, mdu_array_info_t *array,
>  	       int force, int verbose, char *devname,
> -	       char *update, unsigned long rdev, unsigned long long array_size)
> +	       char *update, unsigned long rdev, unsigned long long array_size,
> +	       int raid_slot)
>  {
>  	unsigned long long ldsize;
>  	struct supertype *dev_st = NULL;
> @@ -879,7 +880,10 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
>  	}
>  	disc.major = major(rdev);
>  	disc.minor = minor(rdev);
> -	disc.number =j;
> +	if (raid_slot < 0)
> +		disc.number = j;
> +	else
> +		disc.number = raid_slot;
>  	disc.state = 0;
>  	if (array->not_persistent==0) {
>  		int dfd;
> @@ -920,6 +924,14 @@ int Manage_add(int fd, int tfd, struct mddev_dev *dv,
>  			}
>  		free(used);
>  	}
> +
> +	if (array->state & (1 << MD_SB_CLUSTERED)) {
> +		if (dv->disposition == 'c')
> +			disc.state |= (1 << MD_DISK_CANDIDATE);
> +		else
> +			disc.state |= (1 << MD_DISK_CLUSTER_ADD);
> +	}
> +
>  	if (dv->writemostly == 1)
>  		disc.state |= (1 << MD_DISK_WRITEMOSTLY);
>  	if (tst->ss->external) {
> @@ -1239,6 +1251,7 @@ int Manage_subdevs(char *devname, int fd,
>  	 *        variant on 'A'
>  	 *  'F' - Another variant of 'A', where the device was faulty
>  	 *        so must be removed from the array first.
> +	 *  'c' - confirm the device as found (for clustered environments)
>  	 *
>  	 * For 'f' and 'r', the device can also be a kernel-internal
>  	 * name such as 'sdb'.
> @@ -1254,6 +1267,7 @@ int Manage_subdevs(char *devname, int fd,
>  	struct mdinfo info;
>  	int frozen = 0;
>  	int busy = 0;
> +	int raid_slot = -1;
>  
>  	if (ioctl(fd, GET_ARRAY_INFO, &array)) {
>  		pr_err("Cannot get array info for %s\n",
> @@ -1282,6 +1296,11 @@ int Manage_subdevs(char *devname, int fd,
>  		int rv;
>  		int mj,mn;
>  
> +		raid_slot = -1;
> +		if (dv->disposition == 'c')
> +			parse_cluster_confirm_arg(dv->devname, &dv->devname,
> +					&raid_slot);
> +
>  		if (strcmp(dv->devname, "failed") == 0 ||
>  		    strcmp(dv->devname, "faulty") == 0) {
>  			if (dv->disposition != 'A'
> @@ -1307,6 +1326,11 @@ int Manage_subdevs(char *devname, int fd,
>  		if (strcmp(dv->devname, "missing") == 0) {
>  			struct mddev_dev *add_devlist = NULL;
>  			struct mddev_dev **dp;
> +			if (dv->disposition == 'c') {
> +				rv = ioctl(fd, CLUSTERED_DISK_NACK, NULL);
> +				break;
> +			}
> +
>  			if (dv->disposition != 'A') {
>  				pr_err("'missing' only meaningful with --re-add\n");
>  				goto abort;
> @@ -1399,7 +1423,7 @@ int Manage_subdevs(char *devname, int fd,
>  			else {
>  				int open_err = errno;
>  				if (stat(dv->devname, &stb) != 0) {
> -					pr_err("Cannot find %s: %s\n",
> +					pr_err("%s: %d Cannot find %s: %s\n", __func__, __LINE__,
>  					       dv->devname, strerror(errno));
>  					goto abort;
>  				}
> @@ -1437,6 +1461,7 @@ int Manage_subdevs(char *devname, int fd,
>  		case 'A':
>  		case 'M': /* --re-add missing */
>  		case 'F': /* --re-add faulty  */
> +		case 'c': /* --cluster-confirm */
>  			/* add the device */
>  			if (subarray) {
>  				pr_err("Cannot add disks to a \'member\' array, perform this operation on the parent container\n");
> @@ -1470,7 +1495,7 @@ int Manage_subdevs(char *devname, int fd,
>  			}
>  			rv = Manage_add(fd, tfd, dv, tst, &array,
>  					force, verbose, devname, update,
> -					rdev, array_size);
> +					rdev, array_size, raid_slot);
>  			close(tfd);
>  			tfd = -1;
>  			if (rv < 0)
> diff --git a/ReadMe.c b/ReadMe.c
> index c6286ae..c854cd5 100644
> --- a/ReadMe.c
> +++ b/ReadMe.c
> @@ -169,6 +169,7 @@ struct option long_options[] = {
>      {"wait",	  0, 0,  WaitOpt},
>      {"wait-clean", 0, 0, Waitclean },
>      {"action",    1, 0, Action },
> +    {"cluster-confirm", 0, 0, ClusterConfirm},
>  
>      /* For Detail/Examine */
>      {"brief",	  0, 0, Brief},
> diff --git a/md_p.h b/md_p.h
> index c4846ba..e59504f 100644
> --- a/md_p.h
> +++ b/md_p.h
> @@ -78,6 +78,12 @@
>  #define MD_DISK_ACTIVE		1 /* disk is running but may not be in sync */
>  #define MD_DISK_SYNC		2 /* disk is in sync with the raid set */
>  #define MD_DISK_REMOVED		3 /* disk is in sync with the raid set */
> +#define MD_DISK_CLUSTER_ADD     4 /* Initiate a disk add across the cluster
> +				   * For clustered enviroments only.
> +				   */
> +#define MD_DISK_CANDIDATE	5 /* disk is added as spare (local) until confirmed
> +				   * For clustered enviroments only.
> +				   */
>  
>  #define	MD_DISK_WRITEMOSTLY	9 /* disk is "write-mostly" is RAID1 config.
>  				   * read requests will only be sent here in
> @@ -106,6 +112,7 @@ typedef struct mdp_device_descriptor_s {
>  #define MD_SB_BLOCK_CONTAINER_RESHAPE 3 /* block container wide reshapes */
>  #define MD_SB_BLOCK_VOLUME	4 /* block activation of array, other arrays
>  				   * in container can be activated */
> +#define MD_SB_CLUSTERED		5 /* MD is clustered  */
>  #define	MD_SB_BITMAP_PRESENT	8 /* bitmap may be present nearby */
>  
>  typedef struct mdp_superblock_s {
> diff --git a/md_u.h b/md_u.h
> index be9868a..76068d6 100644
> --- a/md_u.h
> +++ b/md_u.h
> @@ -44,6 +44,7 @@
>  #define STOP_ARRAY		_IO (MD_MAJOR, 0x32)
>  #define STOP_ARRAY_RO		_IO (MD_MAJOR, 0x33)
>  #define RESTART_ARRAY_RW	_IO (MD_MAJOR, 0x34)
> +#define CLUSTERED_DISK_NACK	_IO (MD_MAJOR, 0x35)
>  
>  typedef struct mdu_version_s {
>  	int major;
> diff --git a/mdadm.8.in b/mdadm.8.in
> index c015cbf..6873cc7 100644
> --- a/mdadm.8.in
> +++ b/mdadm.8.in
> @@ -1405,6 +1405,15 @@ will avoid reading from these devices if possible.
>  .BR \-\-readwrite
>  Subsequent devices that are added or re\-added will have the 'write-mostly'
>  flag cleared.
> +.TP
> +.BR \-\-cluster\-confirm
> +Confirm the existence of the device. This is issued in response to an \-\-add
> +request by a node in a cluster. When a node adds a device it sends a message
> +to all nodes in the cluster to look for a device with a UUID. This translates
> +to a udev notification with the UUID of the device to be added and the slot
> +number. The receiving node must acknowledge this message
> +with \-\-cluster\-confirm. Valid arguments are <slot>:<devicename> in case
> +the device is found or <slot>:missing in case the device is not found.
>  
>  .P
>  Each of these options requires that the first device listed is the array
> diff --git a/mdadm.c b/mdadm.c
> index 6963a09..5b4b3ef 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -196,6 +196,7 @@ int main(int argc, char *argv[])
>  		case 'f':
>  		case Fail:
>  		case ReAdd: /* re-add */
> +		case ClusterConfirm:
>  			if (!mode) {
>  				newmode = MANAGE;
>  				shortopt = short_bitmap_options;
> @@ -933,6 +934,9 @@ int main(int argc, char *argv[])
>  					   * remove the device */
>  			devmode = 'f';
>  			continue;
> +		case O(MANAGE, ClusterConfirm):
> +			devmode = 'c';
> +			continue;
>  		case O(MANAGE,Replace):
>  			/* Mark these devices for replacement */
>  			devmode = 'R';
> diff --git a/mdadm.h b/mdadm.h
> index f56d9d6..00c726e 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -346,6 +346,7 @@ enum special_options {
>  	Action,
>  	Nodes,
>  	ClusterName,
> +	ClusterConfirm,
>  };
>  
>  enum prefix_standard {
> @@ -1281,6 +1282,7 @@ extern int parse_uuid(char *str, int uuid[4]);
>  extern int parse_layout_10(char *layout);
>  extern int parse_layout_faulty(char *layout);
>  extern long parse_num(char *num);
> +extern int parse_cluster_confirm_arg(char *inp, char **devname, int *slot);
>  extern int check_ext2(int fd, char *name);
>  extern int check_reiser(int fd, char *name);
>  extern int check_raid(int fd, char *name);
> diff --git a/util.c b/util.c
> index ed9a745..1d82fc7 100644
> --- a/util.c
> +++ b/util.c
> @@ -273,6 +273,17 @@ long parse_num(char *num)
>  }
>  #endif
>  
> +int parse_cluster_confirm_arg(char *input, char **devname, int *slot)
> +{
> +	char *dev;
> +	*slot = strtoul(input, &dev, 10);
> +	if (dev[0] == ':')
> +		*devname = dev+1;
> +	else
> +		return -1;
> +	return 0;
> +}

The logic here hurts my brain :-(

 *slot = strtoul(input, &dev, 10);
 if (dev == input || dev[0] != ':')
     return -1;
 *devname = dev+1;
 return 0;

> +
>  void remove_partitions(int fd)
>  {
>  	/* remove partitions from this block devices.

Thanks,
NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 08/10] mdadm: add the ability to change cluster name
  2015-04-24  7:30 ` [PATCH 08/10] mdadm: add the ability to change cluster name gqjiang
@ 2015-04-29  1:50   ` NeilBrown
  2015-04-30  3:22     ` Guoqing Jiang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-29  1:50 UTC (permalink / raw)
  To: gqjiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 6994 bytes --]

On Fri, 24 Apr 2015 15:30:39 +0800 gqjiang@suse.com wrote:

> From: Guoqing Jiang <gqjiang@suse.com>
> 
> To support change the cluster name, the commit do the followings:
> 
> 1. extend original write_bitmap function for new scenario.
> 2. add the scenarion to handle the modification of cluster's name
>    in write_bitmap1.
> 3. make update_super1 can change the name in mdp_superblock_1.
> 
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  Assemble.c |  5 +++++
>  Grow.c     |  2 +-
>  mdadm.c    |  3 +++
>  mdadm.h    |  7 ++++++-
>  super0.c   |  4 ++--
>  super1.c   | 46 ++++++++++++++++++++++++++++++++++++++++++++--
>  6 files changed, 61 insertions(+), 6 deletions(-)
> 
> diff --git a/Assemble.c b/Assemble.c
> index 25a103d..e1b846c 100644
> --- a/Assemble.c
> +++ b/Assemble.c
> @@ -644,6 +644,11 @@ static int load_devices(struct devs *devices, char *devmap,
>  				*stp = st;
>  				return -1;
>  			}
> +			if (strcmp(c->update, "home-cluster") == 0) {
> +				err = tst->ss->update_super(tst, content, c->update,
> +							    devname, 0, 0, c->homecluster);
> +				tst->ss->write_bitmap(tst, dfd, NameUpdate);
> +			}
>  			if (strcmp(c->update, "uuid")==0 &&
>  			    !ident->uuid_set) {
>  				ident->uuid_set = 1;
> diff --git a/Grow.c b/Grow.c
> index 1122cec..bf44e66 100644
> --- a/Grow.c
> +++ b/Grow.c
> @@ -420,7 +420,7 @@ int Grow_addbitmap(char *devname, int fd, struct context *c, struct shape *s)
>  						    bitmapsize, offset_setable,
>  						    major)
>  						)
> -						st->ss->write_bitmap(st, fd2);
> +						st->ss->write_bitmap(st, fd2, NoUpdate);
>  					else {
>  						pr_err("failed to create internal bitmap - chunksize problem.\n");
>  						close(fd2);
> diff --git a/mdadm.c b/mdadm.c
> index 5b4b3ef..20f195d 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -598,6 +598,7 @@ int main(int argc, char *argv[])
>  			}
>  			continue;
>  		case O(CREATE, ClusterName):
> +		case O(ASSEMBLE, ClusterName):
>  			c.homecluster = optarg;
>  			if (strlen(c.homecluster) > 64) {
>  				pr_err("Cluster name too big.\n");
> @@ -741,6 +742,8 @@ int main(int argc, char *argv[])
>  				continue;
>  			if (strcmp(c.update, "homehost")==0)
>  				continue;
> +			if (strcmp(c.update, "home-cluster")==0)
> +				continue;
>  			if (strcmp(c.update, "devicesize")==0)
>  				continue;
>  			if (strcmp(c.update, "no-bitmap")==0)
> diff --git a/mdadm.h b/mdadm.h
> index 00c726e..d8b0749 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -354,6 +354,11 @@ enum prefix_standard {
>  	IEC
>  };
>  
> +enum bitmap_update {
> +    NoUpdate,
> +    NameUpdate,
> +};
> +
>  /* structures read from config file */
>  /* List of mddevice names and identifiers
>   * Identifiers can be:
> @@ -850,7 +855,7 @@ extern struct superswitch {
>  	/* if add_internal_bitmap succeeded for existing array, this
>  	 * writes it out.
>  	 */
> -	int (*write_bitmap)(struct supertype *st, int fd);
> +	int (*write_bitmap)(struct supertype *st, int fd, enum bitmap_update update);
>  	/* Free the superblock and any other allocated data */
>  	void (*free_super)(struct supertype *st);
>  
> diff --git a/super0.c b/super0.c
> index deb5999..6ad9d39 100644
> --- a/super0.c
> +++ b/super0.c
> @@ -900,7 +900,7 @@ static int write_init_super0(struct supertype *st)
>  		rv = store_super0(st, di->fd);
>  
>  		if (rv == 0 && (sb->state & (1<<MD_SB_BITMAP_PRESENT)))
> -			rv = st->ss->write_bitmap(st, di->fd);
> +			rv = st->ss->write_bitmap(st, di->fd, NoUpdate);
>  
>  		if (rv)
>  			pr_err("failed to write superblock to %s\n",
> @@ -1175,7 +1175,7 @@ static void locate_bitmap0(struct supertype *st, int fd)
>  	lseek64(fd, offset, 0);
>  }
>  
> -static int write_bitmap0(struct supertype *st, int fd)
> +static int write_bitmap0(struct supertype *st, int fd, enum bitmap_update update)
>  {
>  	unsigned long long dsize;
>  	unsigned long long offset;
> diff --git a/super1.c b/super1.c
> index 3c2fce2..e43bef1 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -1073,6 +1073,25 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
>  		info->name[32] = 0;
>  	}
>  
> +	if (strcmp(update, "home-cluster") == 0 &&
> +	    homehost) {
> +		/* Note that 'home-cluster' is to change the name of cluster,
> +		 * it is another "name" update.
> +		 */
> +		char *new_name = xmalloc(sizeof(sb->set_name));
> +		if (strrchr(sb->set_name, ':')) {
> +			strcpy(new_name, strchr(sb->set_name, ':'));
> +		}
> +
> +		memset(sb->set_name, 0, sizeof(sb->set_name));
> +		strcpy(sb->set_name, homehost);
> +		if (new_name)
> +			strcat(sb->set_name, new_name);
> +
> +		free(new_name);
> +		goto out;
> +	}
> +

Please get rid of the 'goto out' and put an 'else' in here.

"homehost" is special because it translates to "name".
"home-cluster" is not special.


>  	if (strcmp(update, "force-one")==0) {
>  		/* Not enough devices for a working array,
>  		 * so bring this one up-to-date
> @@ -1313,6 +1332,7 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
>  	else
>  		rv = -1;
>  
> +out:
>  	sb->sb_csum = calc_sb_1_csum(sb);
>  	return rv;
>  }
> @@ -1691,7 +1711,7 @@ static int write_init_super1(struct supertype *st)
>  		sb->sb_csum = calc_sb_1_csum(sb);
>  		rv = store_super1(st, di->fd);
>  		if (rv == 0 && (__le32_to_cpu(sb->feature_map) & 1))
> -			rv = st->ss->write_bitmap(st, di->fd);
> +			rv = st->ss->write_bitmap(st, di->fd, NoUpdate);
>  		close(di->fd);
>  		di->fd = -1;
>  		if (rv)
> @@ -2175,7 +2195,7 @@ static void locate_bitmap1(struct supertype *st, int fd)
>  	lseek64(fd, offset<<9, 0);
>  }
>  
> -static int write_bitmap1(struct supertype *st, int fd)
> +static int write_bitmap1(struct supertype *st, int fd, enum bitmap_update update)
>  {
>  	struct mdp_superblock_1 *sb = st->sb;
>  	bitmap_super_t *bms = (bitmap_super_t*)(((char*)sb)+MAX_SB_SIZE);
> @@ -2185,6 +2205,28 @@ static int write_bitmap1(struct supertype *st, int fd)
>  	struct align_fd afd;
>  	unsigned int i;
>  
> +	switch (update) {
> +	case NameUpdate:
> +	{
> +	    char *new_name = xmalloc(sizeof(sb->set_name));
> +
> +	    strncpy(new_name, sb->set_name, sizeof(sb->set_name));
> +	    memset((char *)bms->cluster_name, 0, sizeof(bms->cluster_name));
> +
> +	    if (strtok(new_name, ":"))
> +		strncpy((char *)bms->cluster_name, new_name, strlen(sb->set_name));
> +	    else
> +		/* In case the original set_name doesn't like aaa:md* */
> +		strncpy((char *)bms->cluster_name, sb->set_name, strlen(sb->set_name));
> +
> +	    free(new_name);
> +	    break;
> +	}

I don't like the braces there - too confusing.
Just make 'new_name' a top-level variable and get rid of the {}

Thanks,
NeilBrown


> +	case NoUpdate:
> +	default:
> +	    break;
> +	}
> +
>  	init_afd(&afd, fd);
>  
>  	locate_bitmap1(st, fd);


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 09/10] mdadm: change the num of cluster node
  2015-04-24  7:30 ` [PATCH 09/10] mdadm: change the num of cluster node gqjiang
@ 2015-04-29  1:51   ` NeilBrown
  2015-04-30  3:34     ` Guoqing Jiang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-29  1:51 UTC (permalink / raw)
  To: gqjiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 3319 bytes --]

On Fri, 24 Apr 2015 15:30:40 +0800 gqjiang@suse.com wrote:

> From: Guoqing Jiang <gqjiang@suse.com>
> 
> This extends nodes option for assemble mode, make the num of
> cluster node could be change by user.
> 
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  Assemble.c | 4 ++++
>  ReadMe.c   | 2 +-
>  mdadm.c    | 3 +++
>  mdadm.h    | 1 +
>  super1.c   | 6 ++++++
>  5 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/Assemble.c b/Assemble.c
> index e1b846c..22042a9 100644
> --- a/Assemble.c
> +++ b/Assemble.c
> @@ -649,6 +649,10 @@ static int load_devices(struct devs *devices, char *devmap,
>  							    devname, 0, 0, c->homecluster);
>  				tst->ss->write_bitmap(tst, dfd, NameUpdate);
>  			}
> +			if (strcmp(c->update, "nodes") == 0) {
> +				tst->nodes = c->nodes;
> +				tst->ss->write_bitmap(tst, dfd, NodeNumUpdate);
> +			}

Doesn't there need to be some test that there is enough free space on all
devices to store the extra bitmaps (when nodes is increasing)??

NeilBrown





>  			if (strcmp(c->update, "uuid")==0 &&
>  			    !ident->uuid_set) {
>  				ident->uuid_set = 1;
> diff --git a/ReadMe.c b/ReadMe.c
> index c854cd5..d1830e1 100644
> --- a/ReadMe.c
> +++ b/ReadMe.c
> @@ -140,7 +140,7 @@ struct option long_options[] = {
>      {"homehost",  1, 0,  HomeHost},
>      {"symlinks",  1, 0,  Symlinks},
>      {"data-offset",1, 0, DataOffset},
> -    {"nodes",1, 0, Nodes},
> +    {"nodes",1, 0, Nodes}, /* also for --assemble */
>      {"home-cluster",1, 0, ClusterName},
>  
>      /* For assemble */
> diff --git a/mdadm.c b/mdadm.c
> index 20f195d..344bde2 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -589,6 +589,7 @@ int main(int argc, char *argv[])
>  			}
>  			ident.raid_disks = s.raiddisks;
>  			continue;
> +		case O(ASSEMBLE, Nodes):
>  		case O(CREATE, Nodes):
>  			c.nodes = parse_num(optarg);
>  			if (c.nodes <= 0) {
> @@ -744,6 +745,8 @@ int main(int argc, char *argv[])
>  				continue;
>  			if (strcmp(c.update, "home-cluster")==0)
>  				continue;
> +			if (strcmp(c.update, "nodes")==0)
> +				continue;
>  			if (strcmp(c.update, "devicesize")==0)
>  				continue;
>  			if (strcmp(c.update, "no-bitmap")==0)
> diff --git a/mdadm.h b/mdadm.h
> index d8b0749..97892e6 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -357,6 +357,7 @@ enum prefix_standard {
>  enum bitmap_update {
>      NoUpdate,
>      NameUpdate,
> +    NodeNumUpdate,
>  };
>  
>  /* structures read from config file */
> diff --git a/super1.c b/super1.c
> index e43bef1..047e799 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -1329,6 +1329,9 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
>  		sb->devflags |= WriteMostly1;
>  	else if (strcmp(update, "readwrite")==0)
>  		sb->devflags &= ~WriteMostly1;
> +	else if (strcmp(update, "nodes")==0)
> +		/* Just a placeholder since no related member in mdp_superblock_1 */
> +		;
>  	else
>  		rv = -1;
>  
> @@ -2222,6 +2225,9 @@ static int write_bitmap1(struct supertype *st, int fd, enum bitmap_update update
>  	    free(new_name);
>  	    break;
>  	}
> +	case NodeNumUpdate:
> +	    bms->nodes = __cpu_to_le32(st->nodes);
> +	    break;
>  	case NoUpdate:
>  	default:
>  	    break;


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/10] Create n bitmaps for clustered mode
  2015-04-29  1:36   ` NeilBrown
@ 2015-04-29  2:41     ` Goldwyn Rodrigues
  2015-04-30  2:51       ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Goldwyn Rodrigues @ 2015-04-29  2:41 UTC (permalink / raw)
  To: NeilBrown, gqjiang; +Cc: linux-raid



On 04/28/2015 08:36 PM, NeilBrown wrote:
> On Fri, 24 Apr 2015 15:30:34 +0800 gqjiang@suse.com wrote:
>
>> From: Guoqing Jiang <gqjiang@suse.com>
>>
>> For a clustered MD, create bitmaps equal to number of nodes so
>> each node has an independent bitmap.
>>
>> Only the first bitmap is has the bits set so that the first node
>> that assembles the device also performs the sync.
>>
>> The bitmaps are aligned to 4k boundaries.
>>
>> On-disk format:
>>
>> 0                    4k                     8k                    12k
>> -------------------------------------------------------------------
>> | idle                | md super            | bm super [0] + bits |
>> | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
>> | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
>> | bm bits [3, contd]  |                     |                     |
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>   Create.c   |  3 ++-
>>   bitmap.h   |  7 +++++--
>>   mdadm.8.in |  7 ++++++-
>>   mdadm.c    | 17 ++++++++++++++++-
>>   super1.c   | 59 +++++++++++++++++++++++++++++++++++++++++------------------
>>   5 files changed, 70 insertions(+), 23 deletions(-)
>>
>> diff --git a/Create.c b/Create.c
>> index cd5485b..9663dc4 100644
>> --- a/Create.c
>> +++ b/Create.c
>> @@ -752,7 +752,8 @@ int Create(struct supertype *st, char *mddev,
>>   #endif
>>   	}
>>
>> -	if (s->bitmap_file && strcmp(s->bitmap_file, "internal")==0) {
>> +	if (s->bitmap_file && (strcmp(s->bitmap_file, "internal")==0
>> +			|| strcmp(s->bitmap_file, "clustered")==0)) {
>>   		if ((vers%100) < 2) {
>>   			pr_err("internal bitmaps not supported by this kernel.\n");
>>   			goto abort_locked;
>> diff --git a/bitmap.h b/bitmap.h
>> index c8725a3..adbf0b4 100644
>> --- a/bitmap.h
>> +++ b/bitmap.h
>> @@ -154,8 +154,11 @@ typedef struct bitmap_super_s {
>>   	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
>>   	__u32 daemon_sleep; /* 56  seconds between disk flushes */
>>   	__u32 write_behind; /* 60  number of outstanding write-behind writes */
>> -
>> -	__u8  pad[256 - 64]; /* set to zero */
>> +	__u32 sectors_reserved; /* 64 number of 512-byte sectors that are
>> +				 * reserved for the bitmap. */
>> +	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
>> +	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
>> +	__u8  pad[256 - 136]; /* set to zero */
>>   } bitmap_super_t;
>>
>>   /* notes:
>> diff --git a/mdadm.8.in b/mdadm.8.in
>> index a0e8288..c015cbf 100644
>> --- a/mdadm.8.in
>> +++ b/mdadm.8.in
>> @@ -700,7 +700,12 @@ and so is replicated on all devices.  If the word
>>   .B "none"
>>   is given with
>>   .B \-\-grow
>> -mode, then any bitmap that is present is removed.
>> +mode, then any bitmap that is present is removed. If the word
>> +.B "clustered"
>> +is given, the array is created for a clustered environment. One bitmap
>> +is created for each node as defined by the
>> +.B \-\-nodes
>> +parameter and are stored internally.
>>
>>   To help catch typing errors, the filename must contain at least one
>>   slash ('/') if it is a real file (not 'internal' or 'none').
>> diff --git a/mdadm.c b/mdadm.c
>> index e4f8568..6963a09 100644
>> --- a/mdadm.c
>> +++ b/mdadm.c
>> @@ -1111,6 +1111,15 @@ int main(int argc, char *argv[])
>>   				s.bitmap_file = optarg;
>>   				continue;
>>   			}
>> +			if (strcmp(optarg, "clustered")== 0) {
>> +				s.bitmap_file = optarg;
>> +				/* Set the default number of cluster nodes
>> +				 * to 4 if not already set by user
>> +				 */
>> +				if (c.nodes < 1)
>> +					c.nodes = 4;
>> +				continue;
>> +			}
>>   			/* probable typo */
>>   			pr_err("bitmap file must contain a '/', or be 'internal', or 'none'\n"
>>   				"       not '%s'\n", optarg);
>> @@ -1404,7 +1413,13 @@ int main(int argc, char *argv[])
>>   		if (c.delay == 0)
>>   			c.delay = DEFAULT_BITMAP_DELAY;
>>
>> -		if (!strncmp(s.bitmap_file, "internal", 9) ||
>> +		if (!strncmp(s.bitmap_file, "clustered", 9)) {
>> +			if (s.level != 1) {
>> +				pr_err("--bitmap=clustered is currently supported with RAID mirror only\n");
>> +				rv = 1;
>> +				break;
>> +			}
>> +		} else if (!strncmp(s.bitmap_file, "internal", 9) ||
>>   			!strncmp(s.bitmap_file,"none", 4)) {
>>   			if (c.nodes) {
>>   				pr_err("--nodes argument is incompatible with --bitmap=%s.\n",
>> diff --git a/super1.c b/super1.c
>> index f0508fe..ac1b011 100644
>> --- a/super1.c
>> +++ b/super1.c
>> @@ -2144,6 +2144,10 @@ add_internal_bitmap1(struct supertype *st,
>>   	bms->daemon_sleep = __cpu_to_le32(delay);
>>   	bms->sync_size = __cpu_to_le64(size);
>>   	bms->write_behind = __cpu_to_le32(write_behind);
>> +	bms->nodes = __cpu_to_le32(st->nodes);
>> +	if (st->cluster_name)
>> +		strncpy((char *)bms->cluster_name,
>> +				st->cluster_name, strlen(st->cluster_name));
>>
>>   	*chunkp = chunk;
>>   	return 1;
>> @@ -2177,6 +2181,7 @@ static int write_bitmap1(struct supertype *st, int fd)
>>   	void *buf;
>>   	int towrite, n;
>>   	struct align_fd afd;
>> +	unsigned int i;
>>
>>   	init_afd(&afd, fd);
>>
>> @@ -2185,27 +2190,45 @@ static int write_bitmap1(struct supertype *st, int fd)
>>   	if (posix_memalign(&buf, 4096, 4096))
>>   		return -ENOMEM;
>>
>> -	memset(buf, 0xff, 4096);
>> -	memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
>> -
>> -	towrite = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
>> -	towrite = (towrite+7) >> 3; /* bits to bytes */
>> -	towrite += sizeof(bitmap_super_t);
>> -	towrite = ROUND_UP(towrite, 512);
>> -	while (towrite > 0) {
>> -		n = towrite;
>> -		if (n > 4096)
>> -			n = 4096;
>> -		n = awrite(&afd, buf, n);
>> -		if (n > 0)
>> -			towrite -= n;
>> +	/* We use bms->nodes as opposed to st->nodes to
>> +	 * be compatible with write-after-reads such as
>> +	 * the GROW operation.
>> +	 */
>> +	for (i = 0; i < __le32_to_cpu(bms->nodes); i++) {
>> +		/* Only the first bitmap should resync
>> +		 * the whole device
>> +		 */
>> +		if (i)
>> +			memset(buf, 0x00, 4096);
>>   		else
>> +			memset(buf, 0xff, 4096);
>
> Why is the first bitmap initialised to 0x00 and the others to 0xff?
> If there is a good reason it should be documented either in a comment in the
> code or in the changelog entry.


Rather, it is the reverse. The first one is initialized to 0xff and the 
rest are set to 0x00.

The reason is only the first node to assemble the device should perform 
the resync (if --assume-clean is not provided). The comment is right 
above the code. Perhaps I should be more elaborate with the comment.


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 01/10] Add nodes option while creating md
  2015-04-29  1:30   ` NeilBrown
@ 2015-04-30  2:33     ` Guoqing Jiang
  0 siblings, 0 replies; 31+ messages in thread
From: Guoqing Jiang @ 2015-04-30  2:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, rgoldwyn

NeilBrown wrote:
> On Fri, 24 Apr 2015 15:30:32 +0800 gqjiang@suse.com wrote:
>
>   
>> From: Guoqing Jiang <gqjiang@suse.com>
>>
>> Specifies the maximum number of nodes in the cluster that may use
>> this device simultaneously. This is equivalent to the number of
>> bitmaps created in the internal superblock (patches to follow).
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>>     
>
> This doesn't really make much sense coming first in the series.  It sets up a
> value that is never used.
>
> I would much rather 03/10 came first - which would just create 4 bitmaps.
> Then have this patch to allow that '4' to be changed.
>
>
>   
Ok, I will re-arrange them, make 03 first, and keep the others with
original sequence.
>> ---
>>  Create.c   |  1 +
>>  ReadMe.c   |  1 +
>>  mdadm.8.in |  5 +++++
>>  mdadm.c    | 20 +++++++++++++++++++-
>>  mdadm.h    |  3 +++
>>  5 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/Create.c b/Create.c
>> index ef28da0..b73f6cb 100644
>> --- a/Create.c
>> +++ b/Create.c
>> @@ -531,6 +531,7 @@ int Create(struct supertype *st, char *mddev,
>>  				st->ss->name);
>>  		warn = 1;
>>  	}
>> +	st->nodes = c->nodes;
>>  
>>  	if (warn) {
>>  		if (c->runstop!= 1) {
>> diff --git a/ReadMe.c b/ReadMe.c
>> index 87a4916..30c569d 100644
>> --- a/ReadMe.c
>> +++ b/ReadMe.c
>> @@ -140,6 +140,7 @@ struct option long_options[] = {
>>      {"homehost",  1, 0,  HomeHost},
>>      {"symlinks",  1, 0,  Symlinks},
>>      {"data-offset",1, 0, DataOffset},
>> +    {"nodes",1, 0, Nodes},
>>  
>>      /* For assemble */
>>      {"uuid",      1, 0, 'u'},
>> diff --git a/mdadm.8.in b/mdadm.8.in
>> index a630310..bd8d59e 100644
>> --- a/mdadm.8.in
>> +++ b/mdadm.8.in
>> @@ -966,6 +966,11 @@ However for RAID0, it is not possible to add spares.  So to increase
>>  the number of devices in a RAID0, it is necessary to set the new
>>  number of devices, and to add the new devices, in the same command.
>>  
>> +.TP
>> +.BR \-\-nodes
>> +Specify the maximum number of nodes in the cluster that will use this
>> +device simultaneously. If not specified, this defaults to 4.
>> +
>>     
>
> I think this should mention that it only makes sense with --bitmap=cluster
> (which of course isn't available at this point....)
>
>
>   
Will add it.
>>  .SH For assemble:
>>  
>>  .TP
>> diff --git a/mdadm.c b/mdadm.c
>> index 3e8c49b..bce6a76 100644
>> --- a/mdadm.c
>> +++ b/mdadm.c
>> @@ -588,7 +588,14 @@ int main(int argc, char *argv[])
>>  			}
>>  			ident.raid_disks = s.raiddisks;
>>  			continue;
>> -
>> +		case O(CREATE, Nodes):
>> +			c.nodes = parse_num(optarg);
>> +			if (c.nodes <= 0) {
>> +				pr_err("invalid number for the number of "
>> +						"cluster nodes: %s\n", optarg);
>> +				exit(2);
>> +			}
>> +			continue;
>>  		case O(CREATE,'x'): /* number of spare (eXtra) disks */
>>  			if (s.sparedisks) {
>>  				pr_err("spare-devices set twice: %d and %s\n",
>> @@ -1377,6 +1384,17 @@ int main(int argc, char *argv[])
>>  	case CREATE:
>>  		if (c.delay == 0)
>>  			c.delay = DEFAULT_BITMAP_DELAY;
>> +
>> +		if (!strncmp(s.bitmap_file, "internal", 9) ||
>> +			!strncmp(s.bitmap_file,"none", 4)) {
>>     
>
> I'm sorry but I absolutely *hate* this construct.
> The '!' at the front makes it seem like you are testing that the string does
> *not* have that value, but it is exactly the reverse.
> And why "strncmp" rather than "strcmp" ??
>
> Please use
>
>     if (strcmp(s.bitmap_file, "internal") == 0 ||
>         strcmp(s.bitmap_file, "none) == 0) {
>
> except.... what about if bitmap_file in /path/to/somewhere.  Presumably you
> want to exclude that case too.
> May be
>    if (strcmp(s.bitmap_file, "cluster") != 0) {
> ??
>   
Thanks,  which is better for understanding.

Regards,
Guoqing

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/10] Create n bitmaps for clustered mode
  2015-04-29  1:41   ` NeilBrown
@ 2015-04-30  2:44     ` Guoqing Jiang
  2015-04-30  2:53       ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Guoqing Jiang @ 2015-04-30  2:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, rgoldwyn

NeilBrown wrote:
> On Fri, 24 Apr 2015 15:30:34 +0800 gqjiang@suse.com wrote:
>
>   
>> From: Guoqing Jiang <gqjiang@suse.com>
>>
>> For a clustered MD, create bitmaps equal to number of nodes so
>> each node has an independent bitmap.
>>
>> Only the first bitmap is has the bits set so that the first node
>> that assembles the device also performs the sync.
>>
>> The bitmaps are aligned to 4k boundaries.
>>
>> On-disk format:
>>
>> 0                    4k                     8k                    12k
>> -------------------------------------------------------------------
>> | idle                | md super            | bm super [0] + bits |
>> | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
>> | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
>> | bm bits [3, contd]  |                     |                     |
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>  Create.c   |  3 ++-
>>  bitmap.h   |  7 +++++--
>>  mdadm.8.in |  7 ++++++-
>>  mdadm.c    | 17 ++++++++++++++++-
>>  super1.c   | 59 +++++++++++++++++++++++++++++++++++++++++------------------
>>  5 files changed, 70 insertions(+), 23 deletions(-)
>>
>> diff --git a/Create.c b/Create.c
>> index cd5485b..9663dc4 100644
>> --- a/Create.c
>> +++ b/Create.c
>> @@ -752,7 +752,8 @@ int Create(struct supertype *st, char *mddev,
>>  #endif
>>  	}
>>  
>> -	if (s->bitmap_file && strcmp(s->bitmap_file, "internal")==0) {
>> +	if (s->bitmap_file && (strcmp(s->bitmap_file, "internal")==0
>> +			|| strcmp(s->bitmap_file, "clustered")==0)) {
>>  		if ((vers%100) < 2) {
>>  			pr_err("internal bitmaps not supported by this kernel.\n");
>>  			goto abort_locked;
>> diff --git a/bitmap.h b/bitmap.h
>> index c8725a3..adbf0b4 100644
>> --- a/bitmap.h
>> +++ b/bitmap.h
>> @@ -154,8 +154,11 @@ typedef struct bitmap_super_s {
>>  	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
>>  	__u32 daemon_sleep; /* 56  seconds between disk flushes */
>>  	__u32 write_behind; /* 60  number of outstanding write-behind writes */
>> -
>> -	__u8  pad[256 - 64]; /* set to zero */
>> +	__u32 sectors_reserved; /* 64 number of 512-byte sectors that are
>> +				 * reserved for the bitmap. */
>> +	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
>> +	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
>> +	__u8  pad[256 - 136]; /* set to zero */
>>  } bitmap_super_t;
>>     
>
> I missed this the first time, but these fields that you have added need to be
> added to sb_le_to_cpu().
>
>   
I guess only nodes need to be added as follows:
    sb->nodes = __le32_to_cpu(sb->nodes);

Does the cluster_name need to be addressed too? Like.
    for (i = 0; i < 64; i++)
       sb->cluster_name[i] = (unsigned
char)__le16_to_cpu(sb->cluster_name[i]);

Thanks,
Guoqing
> NeilBrown
>   


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/10] Create n bitmaps for clustered mode
  2015-04-29  2:41     ` Goldwyn Rodrigues
@ 2015-04-30  2:51       ` NeilBrown
  2015-04-30 12:44         ` Goldwyn Rodrigues
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-30  2:51 UTC (permalink / raw)
  To: Goldwyn Rodrigues; +Cc: gqjiang, linux-raid

[-- Attachment #1: Type: text/plain, Size: 7995 bytes --]

On Tue, 28 Apr 2015 21:41:47 -0500 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:

> 
> 
> On 04/28/2015 08:36 PM, NeilBrown wrote:
> > On Fri, 24 Apr 2015 15:30:34 +0800 gqjiang@suse.com wrote:
> >
> >> From: Guoqing Jiang <gqjiang@suse.com>
> >>
> >> For a clustered MD, create bitmaps equal to number of nodes so
> >> each node has an independent bitmap.
> >>
> >> Only the first bitmap is has the bits set so that the first node
> >> that assembles the device also performs the sync.
> >>
> >> The bitmaps are aligned to 4k boundaries.
> >>
> >> On-disk format:
> >>
> >> 0                    4k                     8k                    12k
> >> -------------------------------------------------------------------
> >> | idle                | md super            | bm super [0] + bits |
> >> | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
> >> | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
> >> | bm bits [3, contd]  |                     |                     |
> >>
> >> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> >> ---
> >>   Create.c   |  3 ++-
> >>   bitmap.h   |  7 +++++--
> >>   mdadm.8.in |  7 ++++++-
> >>   mdadm.c    | 17 ++++++++++++++++-
> >>   super1.c   | 59 +++++++++++++++++++++++++++++++++++++++++------------------
> >>   5 files changed, 70 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/Create.c b/Create.c
> >> index cd5485b..9663dc4 100644
> >> --- a/Create.c
> >> +++ b/Create.c
> >> @@ -752,7 +752,8 @@ int Create(struct supertype *st, char *mddev,
> >>   #endif
> >>   	}
> >>
> >> -	if (s->bitmap_file && strcmp(s->bitmap_file, "internal")==0) {
> >> +	if (s->bitmap_file && (strcmp(s->bitmap_file, "internal")==0
> >> +			|| strcmp(s->bitmap_file, "clustered")==0)) {
> >>   		if ((vers%100) < 2) {
> >>   			pr_err("internal bitmaps not supported by this kernel.\n");
> >>   			goto abort_locked;
> >> diff --git a/bitmap.h b/bitmap.h
> >> index c8725a3..adbf0b4 100644
> >> --- a/bitmap.h
> >> +++ b/bitmap.h
> >> @@ -154,8 +154,11 @@ typedef struct bitmap_super_s {
> >>   	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
> >>   	__u32 daemon_sleep; /* 56  seconds between disk flushes */
> >>   	__u32 write_behind; /* 60  number of outstanding write-behind writes */
> >> -
> >> -	__u8  pad[256 - 64]; /* set to zero */
> >> +	__u32 sectors_reserved; /* 64 number of 512-byte sectors that are
> >> +				 * reserved for the bitmap. */
> >> +	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
> >> +	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
> >> +	__u8  pad[256 - 136]; /* set to zero */
> >>   } bitmap_super_t;
> >>
> >>   /* notes:
> >> diff --git a/mdadm.8.in b/mdadm.8.in
> >> index a0e8288..c015cbf 100644
> >> --- a/mdadm.8.in
> >> +++ b/mdadm.8.in
> >> @@ -700,7 +700,12 @@ and so is replicated on all devices.  If the word
> >>   .B "none"
> >>   is given with
> >>   .B \-\-grow
> >> -mode, then any bitmap that is present is removed.
> >> +mode, then any bitmap that is present is removed. If the word
> >> +.B "clustered"
> >> +is given, the array is created for a clustered environment. One bitmap
> >> +is created for each node as defined by the
> >> +.B \-\-nodes
> >> +parameter and are stored internally.
> >>
> >>   To help catch typing errors, the filename must contain at least one
> >>   slash ('/') if it is a real file (not 'internal' or 'none').
> >> diff --git a/mdadm.c b/mdadm.c
> >> index e4f8568..6963a09 100644
> >> --- a/mdadm.c
> >> +++ b/mdadm.c
> >> @@ -1111,6 +1111,15 @@ int main(int argc, char *argv[])
> >>   				s.bitmap_file = optarg;
> >>   				continue;
> >>   			}
> >> +			if (strcmp(optarg, "clustered")== 0) {
> >> +				s.bitmap_file = optarg;
> >> +				/* Set the default number of cluster nodes
> >> +				 * to 4 if not already set by user
> >> +				 */
> >> +				if (c.nodes < 1)
> >> +					c.nodes = 4;
> >> +				continue;
> >> +			}
> >>   			/* probable typo */
> >>   			pr_err("bitmap file must contain a '/', or be 'internal', or 'none'\n"
> >>   				"       not '%s'\n", optarg);
> >> @@ -1404,7 +1413,13 @@ int main(int argc, char *argv[])
> >>   		if (c.delay == 0)
> >>   			c.delay = DEFAULT_BITMAP_DELAY;
> >>
> >> -		if (!strncmp(s.bitmap_file, "internal", 9) ||
> >> +		if (!strncmp(s.bitmap_file, "clustered", 9)) {
> >> +			if (s.level != 1) {
> >> +				pr_err("--bitmap=clustered is currently supported with RAID mirror only\n");
> >> +				rv = 1;
> >> +				break;
> >> +			}
> >> +		} else if (!strncmp(s.bitmap_file, "internal", 9) ||
> >>   			!strncmp(s.bitmap_file,"none", 4)) {
> >>   			if (c.nodes) {
> >>   				pr_err("--nodes argument is incompatible with --bitmap=%s.\n",
> >> diff --git a/super1.c b/super1.c
> >> index f0508fe..ac1b011 100644
> >> --- a/super1.c
> >> +++ b/super1.c
> >> @@ -2144,6 +2144,10 @@ add_internal_bitmap1(struct supertype *st,
> >>   	bms->daemon_sleep = __cpu_to_le32(delay);
> >>   	bms->sync_size = __cpu_to_le64(size);
> >>   	bms->write_behind = __cpu_to_le32(write_behind);
> >> +	bms->nodes = __cpu_to_le32(st->nodes);
> >> +	if (st->cluster_name)
> >> +		strncpy((char *)bms->cluster_name,
> >> +				st->cluster_name, strlen(st->cluster_name));
> >>
> >>   	*chunkp = chunk;
> >>   	return 1;
> >> @@ -2177,6 +2181,7 @@ static int write_bitmap1(struct supertype *st, int fd)
> >>   	void *buf;
> >>   	int towrite, n;
> >>   	struct align_fd afd;
> >> +	unsigned int i;
> >>
> >>   	init_afd(&afd, fd);
> >>
> >> @@ -2185,27 +2190,45 @@ static int write_bitmap1(struct supertype *st, int fd)
> >>   	if (posix_memalign(&buf, 4096, 4096))
> >>   		return -ENOMEM;
> >>
> >> -	memset(buf, 0xff, 4096);
> >> -	memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
> >> -
> >> -	towrite = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
> >> -	towrite = (towrite+7) >> 3; /* bits to bytes */
> >> -	towrite += sizeof(bitmap_super_t);
> >> -	towrite = ROUND_UP(towrite, 512);
> >> -	while (towrite > 0) {
> >> -		n = towrite;
> >> -		if (n > 4096)
> >> -			n = 4096;
> >> -		n = awrite(&afd, buf, n);
> >> -		if (n > 0)
> >> -			towrite -= n;
> >> +	/* We use bms->nodes as opposed to st->nodes to
> >> +	 * be compatible with write-after-reads such as
> >> +	 * the GROW operation.
> >> +	 */
> >> +	for (i = 0; i < __le32_to_cpu(bms->nodes); i++) {
> >> +		/* Only the first bitmap should resync
> >> +		 * the whole device
> >> +		 */
> >> +		if (i)
> >> +			memset(buf, 0x00, 4096);
> >>   		else
> >> +			memset(buf, 0xff, 4096);
> >
> > Why is the first bitmap initialised to 0x00 and the others to 0xff?
> > If there is a good reason it should be documented either in a comment in the
> > code or in the changelog entry.
> 
> 
> Rather, it is the reverse. The first one is initialized to 0xff and the 
> rest are set to 0x00.
> 
> The reason is only the first node to assemble the device should perform 
> the resync (if --assume-clean is not provided). The comment is right 
> above the code. Perhaps I should be more elaborate with the comment.
> 
> 

Hmmm... Perhaps I should read code with my eyes open!

Not sure I agree though.  Why should the first node be special?
What if node '0' doesn't get activated?
I guess it always well because of the way numbers are assigned, but I'm not
feeling very comfortable...

Thinking a bit more ... why do we set any bits to '1'?
Why not just set BITMAP_STALE, and let the kernel figure things out.

For the single-node case, BITMAP_STALE is the same as setting all the bits to
one.
For the cluster case, we can get BITMAP_STALE to do whatever we want. and we
should make sure we handle it correctly anyway.

So maybe mdadm should set BITMAP_STALE, and leave all the bits as 0.

Thoughts?

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/10] Create n bitmaps for clustered mode
  2015-04-30  2:44     ` Guoqing Jiang
@ 2015-04-30  2:53       ` NeilBrown
  0 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2015-04-30  2:53 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 3608 bytes --]

On Thu, 30 Apr 2015 10:44:23 +0800 Guoqing Jiang <gqJiang@suse.com> wrote:

> NeilBrown wrote:
> > On Fri, 24 Apr 2015 15:30:34 +0800 gqjiang@suse.com wrote:
> >
> >   
> >> From: Guoqing Jiang <gqjiang@suse.com>
> >>
> >> For a clustered MD, create bitmaps equal to number of nodes so
> >> each node has an independent bitmap.
> >>
> >> Only the first bitmap is has the bits set so that the first node
> >> that assembles the device also performs the sync.
> >>
> >> The bitmaps are aligned to 4k boundaries.
> >>
> >> On-disk format:
> >>
> >> 0                    4k                     8k                    12k
> >> -------------------------------------------------------------------
> >> | idle                | md super            | bm super [0] + bits |
> >> | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
> >> | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
> >> | bm bits [3, contd]  |                     |                     |
> >>
> >> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> >> ---
> >>  Create.c   |  3 ++-
> >>  bitmap.h   |  7 +++++--
> >>  mdadm.8.in |  7 ++++++-
> >>  mdadm.c    | 17 ++++++++++++++++-
> >>  super1.c   | 59 +++++++++++++++++++++++++++++++++++++++++------------------
> >>  5 files changed, 70 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/Create.c b/Create.c
> >> index cd5485b..9663dc4 100644
> >> --- a/Create.c
> >> +++ b/Create.c
> >> @@ -752,7 +752,8 @@ int Create(struct supertype *st, char *mddev,
> >>  #endif
> >>  	}
> >>  
> >> -	if (s->bitmap_file && strcmp(s->bitmap_file, "internal")==0) {
> >> +	if (s->bitmap_file && (strcmp(s->bitmap_file, "internal")==0
> >> +			|| strcmp(s->bitmap_file, "clustered")==0)) {
> >>  		if ((vers%100) < 2) {
> >>  			pr_err("internal bitmaps not supported by this kernel.\n");
> >>  			goto abort_locked;
> >> diff --git a/bitmap.h b/bitmap.h
> >> index c8725a3..adbf0b4 100644
> >> --- a/bitmap.h
> >> +++ b/bitmap.h
> >> @@ -154,8 +154,11 @@ typedef struct bitmap_super_s {
> >>  	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
> >>  	__u32 daemon_sleep; /* 56  seconds between disk flushes */
> >>  	__u32 write_behind; /* 60  number of outstanding write-behind writes */
> >> -
> >> -	__u8  pad[256 - 64]; /* set to zero */
> >> +	__u32 sectors_reserved; /* 64 number of 512-byte sectors that are
> >> +				 * reserved for the bitmap. */
> >> +	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
> >> +	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
> >> +	__u8  pad[256 - 136]; /* set to zero */
> >>  } bitmap_super_t;
> >>     
> >
> > I missed this the first time, but these fields that you have added need to be
> > added to sb_le_to_cpu().
> >
> >   
> I guess only nodes need to be added as follows:
>     sb->nodes = __le32_to_cpu(sb->nodes);

Why not "sectors_reserved".  It may not be used, but best to be consistent.

> 
> Does the cluster_name need to be addressed too? Like.
>     for (i = 0; i < 64; i++)
>        sb->cluster_name[i] = (unsigned
> char)__le16_to_cpu(sb->cluster_name[i]);

No, cluster_name is __u8, and they aren't affected by CPU endian-ness.

Thanks,
NeilBrown


> 
> Thanks,
> Guoqing
> > NeilBrown
> >   
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/10] Show all bitmaps while examining bitmap
  2015-04-29  1:41   ` NeilBrown
@ 2015-04-30  3:17     ` Guoqing Jiang
  2015-04-30  4:45       ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Guoqing Jiang @ 2015-04-30  3:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, rgoldwyn

NeilBrown wrote:
> On Fri, 24 Apr 2015 15:30:35 +0800 gqjiang@suse.com wrote:
>
>   
>> From: Guoqing Jiang <gqjiang@suse.com>
>>
>> This adds capability of exmining bitmaps corresponding to all
>> nodes/slots on the device.
>>
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>  bitmap.c | 35 ++++++++++++++++++++++++++---------
>>  1 file changed, 26 insertions(+), 9 deletions(-)
>>
>> diff --git a/bitmap.c b/bitmap.c
>> index b1d54a6..ab83f4e 100644
>> --- a/bitmap.c
>> +++ b/bitmap.c
>> @@ -258,7 +258,7 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
>>  	int rv = 1;
>>  	char buf[64];
>>  	int swap;
>> -	int fd;
>> +	int fd, i;
>>  	__u32 uuid32[4];
>>  
>>  	fd = bitmap_file_open(filename, &st);
>> @@ -315,9 +315,6 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
>>  		       uuid32[2],
>>  		       uuid32[3]);
>>  
>> -	printf("          Events : %llu\n", (unsigned long long)sb->events);
>> -	printf("  Events Cleared : %llu\n", (unsigned long long)sb->events_cleared);
>> -	printf("           State : %s\n", bitmap_state(sb->state));
>>     
>
> Can we please leave this where it is in the case where there is only one node?
> Only move it down if there are multiple nodes to report on.
>
>   
Yes, it could be possible. Just to make sure understand you correctly,
what about the below?

if (sb->nodes == 0) {
    /* code for more than one node */
} else {
   /* original code */
}

Thanks,
Guoqing




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/10] Add a new clustered disk
  2015-04-29  1:45   ` NeilBrown
@ 2015-04-30  3:20     ` Guoqing Jiang
  0 siblings, 0 replies; 31+ messages in thread
From: Guoqing Jiang @ 2015-04-30  3:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: gqjiang, linux-raid, rgoldwyn

NeilBrown wrote:
> On Fri, 24 Apr 2015 15:30:36 +0800 gqjiang@suse.com wrote:
>
>   
>> From: Guoqing Jiang <gqjiang@suse.com>
>>
>> A clustered disk is added by the traditional --add sequence.
>> However, other nodes need to acknowledge that they can "see"
>> the device. This is done by --cluster-confirm:
>>
>> --cluster-confirm Y:/dev/whatever (if disk is found)
>> or
>> --cluster-confirm Y:missing (if disk is not found)
>>
>> The node initiating the --add, has the disk state tagged with
>> MD_DISK_CLUSTER_ADD and the one confirming tag the disk with
>> MD_DISK_CANDIDATE.
>>     
>
> You haven't explained 'Y' here.  It looks like it means 'Yes', but it doesn't.
>
>
>   
Right, actually 'Y' stands for the slot, will modify it.
>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>>
>>     
[snip]
>>  
>> +int parse_cluster_confirm_arg(char *input, char **devname, int *slot)
>> +{
>> +	char *dev;
>> +	*slot = strtoul(input, &dev, 10);
>> +	if (dev[0] == ':')
>> +		*devname = dev+1;
>> +	else
>> +		return -1;
>> +	return 0;
>> +}
>>     
>
> The logic here hurts my brain :-(
>
>  *slot = strtoul(input, &dev, 10);
>  if (dev == input || dev[0] != ':')
>      return -1;
>  *devname = dev+1;
>  return 0;
>
>   
Thanks for above, :)

Regards,
Guoqing

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 08/10] mdadm: add the ability to change cluster name
  2015-04-29  1:50   ` NeilBrown
@ 2015-04-30  3:22     ` Guoqing Jiang
  0 siblings, 0 replies; 31+ messages in thread
From: Guoqing Jiang @ 2015-04-30  3:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: gqjiang, linux-raid, rgoldwyn

NeilBrown wrote:
> On Fri, 24 Apr 2015 15:30:39 +0800 gqjiang@suse.com wrote:
>
>   
>> --- a/super1.c
>> +++ b/super1.c
>> @@ -1073,6 +1073,25 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
>>  		info->name[32] = 0;
>>  	}
>>  
>> +	if (strcmp(update, "home-cluster") == 0 &&
>> +	    homehost) {
>> +		/* Note that 'home-cluster' is to change the name of cluster,
>> +		 * it is another "name" update.
>> +		 */
>> +		char *new_name = xmalloc(sizeof(sb->set_name));
>> +		if (strrchr(sb->set_name, ':')) {
>> +			strcpy(new_name, strchr(sb->set_name, ':'));
>> +		}
>> +
>> +		memset(sb->set_name, 0, sizeof(sb->set_name));
>> +		strcpy(sb->set_name, homehost);
>> +		if (new_name)
>> +			strcat(sb->set_name, new_name);
>> +
>> +		free(new_name);
>> +		goto out;
>> +	}
>> +
>>     
>
> Please get rid of the 'goto out' and put an 'else' in here.
>
> "homehost" is special because it translates to "name".
> "home-cluster" is not special.
>
>   
Got it.
>   
>>  	if (strcmp(update, "force-one")==0) {
>>  		/* Not enough devices for a working array,
>>  		 * so bring this one up-to-date
>> @@ -1313,6 +1332,7 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
>>  	else
>>  		rv = -1;
>>  
>> +out:
>>  	sb->sb_csum = calc_sb_1_csum(sb);
>>  	return rv;
>>  }
>> @@ -1691,7 +1711,7 @@ static int write_init_super1(struct supertype *st)
>>  		sb->sb_csum = calc_sb_1_csum(sb);
>>  		rv = store_super1(st, di->fd);
>>  		if (rv == 0 && (__le32_to_cpu(sb->feature_map) & 1))
>> -			rv = st->ss->write_bitmap(st, di->fd);
>> +			rv = st->ss->write_bitmap(st, di->fd, NoUpdate);
>>  		close(di->fd);
>>  		di->fd = -1;
>>  		if (rv)
>> @@ -2175,7 +2195,7 @@ static void locate_bitmap1(struct supertype *st, int fd)
>>  	lseek64(fd, offset<<9, 0);
>>  }
>>  
>> -static int write_bitmap1(struct supertype *st, int fd)
>> +static int write_bitmap1(struct supertype *st, int fd, enum bitmap_update update)
>>  {
>>  	struct mdp_superblock_1 *sb = st->sb;
>>  	bitmap_super_t *bms = (bitmap_super_t*)(((char*)sb)+MAX_SB_SIZE);
>> @@ -2185,6 +2205,28 @@ static int write_bitmap1(struct supertype *st, int fd)
>>  	struct align_fd afd;
>>  	unsigned int i;
>>  
>> +	switch (update) {
>> +	case NameUpdate:
>> +	{
>> +	    char *new_name = xmalloc(sizeof(sb->set_name));
>> +
>> +	    strncpy(new_name, sb->set_name, sizeof(sb->set_name));
>> +	    memset((char *)bms->cluster_name, 0, sizeof(bms->cluster_name));
>> +
>> +	    if (strtok(new_name, ":"))
>> +		strncpy((char *)bms->cluster_name, new_name, strlen(sb->set_name));
>> +	    else
>> +		/* In case the original set_name doesn't like aaa:md* */
>> +		strncpy((char *)bms->cluster_name, sb->set_name, strlen(sb->set_name));
>> +
>> +	    free(new_name);
>> +	    break;
>> +	}
>>     
>
> I don't like the braces there - too confusing.
> Just make 'new_name' a top-level variable and get rid of the {}
>
>   
No problem, will do it.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 09/10] mdadm: change the num of cluster node
  2015-04-29  1:51   ` NeilBrown
@ 2015-04-30  3:34     ` Guoqing Jiang
  2015-04-30  6:47       ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Guoqing Jiang @ 2015-04-30  3:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: gqjiang, linux-raid, rgoldwyn

NeilBrown wrote:
> On Fri, 24 Apr 2015 15:30:40 +0800 gqjiang@suse.com wrote:
>
>   
>> From: Guoqing Jiang <gqjiang@suse.com>
>>
>> This extends nodes option for assemble mode, make the num of
>> cluster node could be change by user.
>>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>  Assemble.c | 4 ++++
>>  ReadMe.c   | 2 +-
>>  mdadm.c    | 3 +++
>>  mdadm.h    | 1 +
>>  super1.c   | 6 ++++++
>>  5 files changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/Assemble.c b/Assemble.c
>> index e1b846c..22042a9 100644
>> --- a/Assemble.c
>> +++ b/Assemble.c
>> @@ -649,6 +649,10 @@ static int load_devices(struct devs *devices, char *devmap,
>>  							    devname, 0, 0, c->homecluster);
>>  				tst->ss->write_bitmap(tst, dfd, NameUpdate);
>>  			}
>> +			if (strcmp(c->update, "nodes") == 0) {
>> +				tst->nodes = c->nodes;
>> +				tst->ss->write_bitmap(tst, dfd, NodeNumUpdate);
>> +			}
>>     
>
> Doesn't there need to be some test that there is enough free space on all
> devices to store the extra bitmaps (when nodes is increasing)??
>
>   
Agree, could you pls elaborate more about the test? I guess the test
need to be run
before write_bitmap1call the awrite(&afd, buf, n).

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/10] Show all bitmaps while examining bitmap
  2015-04-30  3:17     ` Guoqing Jiang
@ 2015-04-30  4:45       ` NeilBrown
  0 siblings, 0 replies; 31+ messages in thread
From: NeilBrown @ 2015-04-30  4:45 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 2395 bytes --]

On Thu, 30 Apr 2015 11:17:28 +0800 Guoqing Jiang <gqJiang@suse.com> wrote:

> NeilBrown wrote:
> > On Fri, 24 Apr 2015 15:30:35 +0800 gqjiang@suse.com wrote:
> >
> >   
> >> From: Guoqing Jiang <gqjiang@suse.com>
> >>
> >> This adds capability of exmining bitmaps corresponding to all
> >> nodes/slots on the device.
> >>
> >> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> >> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> >> ---
> >>  bitmap.c | 35 ++++++++++++++++++++++++++---------
> >>  1 file changed, 26 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/bitmap.c b/bitmap.c
> >> index b1d54a6..ab83f4e 100644
> >> --- a/bitmap.c
> >> +++ b/bitmap.c
> >> @@ -258,7 +258,7 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
> >>  	int rv = 1;
> >>  	char buf[64];
> >>  	int swap;
> >> -	int fd;
> >> +	int fd, i;
> >>  	__u32 uuid32[4];
> >>  
> >>  	fd = bitmap_file_open(filename, &st);
> >> @@ -315,9 +315,6 @@ int ExamineBitmap(char *filename, int brief, struct supertype *st)
> >>  		       uuid32[2],
> >>  		       uuid32[3]);
> >>  
> >> -	printf("          Events : %llu\n", (unsigned long long)sb->events);
> >> -	printf("  Events Cleared : %llu\n", (unsigned long long)sb->events_cleared);
> >> -	printf("           State : %s\n", bitmap_state(sb->state));
> >>     
> >
> > Can we please leave this where it is in the case where there is only one node?
> > Only move it down if there are multiple nodes to report on.
> >
> >   
> Yes, it could be possible. Just to make sure understand you correctly,
> what about the below?
> 
> if (sb->nodes == 0) {
>     /* code for more than one node */
> } else {
>    /* original code */
> }
> 

Something a bit like that, though !=0 of course.

The original code put some info at the top.  When there are multiple nodes
that makes more sense  at the bottom.
So:

 if (sb->nodes == 0) {
    original code that will be moved
 }
 more original code that isn't changing
 if (sb->nodes == 0) {
    other original code
 } else {
    new code
 }

or something like that.

Thanks,
NeilBrown


> Thanks,
> Guoqing
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 09/10] mdadm: change the num of cluster node
  2015-04-30  3:34     ` Guoqing Jiang
@ 2015-04-30  6:47       ` NeilBrown
  2015-04-30 10:04         ` Guoqing Jiang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2015-04-30  6:47 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: gqjiang, linux-raid, rgoldwyn

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

On Thu, 30 Apr 2015 11:34:05 +0800 Guoqing Jiang <jgq516@gmail.com> wrote:

> NeilBrown wrote:
> > On Fri, 24 Apr 2015 15:30:40 +0800 gqjiang@suse.com wrote:
> >
> >   
> >> From: Guoqing Jiang <gqjiang@suse.com>
> >>
> >> This extends nodes option for assemble mode, make the num of
> >> cluster node could be change by user.
> >>
> >> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> >> ---
> >>  Assemble.c | 4 ++++
> >>  ReadMe.c   | 2 +-
> >>  mdadm.c    | 3 +++
> >>  mdadm.h    | 1 +
> >>  super1.c   | 6 ++++++
> >>  5 files changed, 15 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/Assemble.c b/Assemble.c
> >> index e1b846c..22042a9 100644
> >> --- a/Assemble.c
> >> +++ b/Assemble.c
> >> @@ -649,6 +649,10 @@ static int load_devices(struct devs *devices, char *devmap,
> >>  							    devname, 0, 0, c->homecluster);
> >>  				tst->ss->write_bitmap(tst, dfd, NameUpdate);
> >>  			}
> >> +			if (strcmp(c->update, "nodes") == 0) {
> >> +				tst->nodes = c->nodes;
> >> +				tst->ss->write_bitmap(tst, dfd, NodeNumUpdate);
> >> +			}
> >>     
> >
> > Doesn't there need to be some test that there is enough free space on all
> > devices to store the extra bitmaps (when nodes is increasing)??
> >
> >   
> Agree, could you pls elaborate more about the test? I guess the test
> need to be run
> before write_bitmap1call the awrite(&afd, buf, n).

Normally we don't increase the size of a bitmap while it is alive.
When an array is reshaped, bitmap_resize in the kernel actually changes the
bitmap chunk size if necessary so that the bitmap will fit in the available
space. 
To make the bitmap larger, the only current approach is to delete the bitmap
and add a new one.  This is handled by add_internal_bitmap1 in mdadm.
So make the bitmap bigger you will need to perform similar calculations taht
add_internal_bitmap1 performs, though instead of choosing a size or
chunksize, you see need to check if the space is sufficient.

Hope that helps,
NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 09/10] mdadm: change the num of cluster node
  2015-04-30  6:47       ` NeilBrown
@ 2015-04-30 10:04         ` Guoqing Jiang
  0 siblings, 0 replies; 31+ messages in thread
From: Guoqing Jiang @ 2015-04-30 10:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Guoqing Jiang, linux-raid, rgoldwyn

NeilBrown wrote:
> On Thu, 30 Apr 2015 11:34:05 +0800 Guoqing Jiang <jgq516@gmail.com> wrote:
>
>   
>> NeilBrown wrote:
>>     
>>> On Fri, 24 Apr 2015 15:30:40 +0800 gqjiang@suse.com wrote:
>>>
>>>   
>>>       
>>>> From: Guoqing Jiang <gqjiang@suse.com>
>>>>
>>>> This extends nodes option for assemble mode, make the num of
>>>> cluster node could be change by user.
>>>>
>>>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>>>> ---
>>>>  Assemble.c | 4 ++++
>>>>  ReadMe.c   | 2 +-
>>>>  mdadm.c    | 3 +++
>>>>  mdadm.h    | 1 +
>>>>  super1.c   | 6 ++++++
>>>>  5 files changed, 15 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Assemble.c b/Assemble.c
>>>> index e1b846c..22042a9 100644
>>>> --- a/Assemble.c
>>>> +++ b/Assemble.c
>>>> @@ -649,6 +649,10 @@ static int load_devices(struct devs *devices, char *devmap,
>>>>  							    devname, 0, 0, c->homecluster);
>>>>  				tst->ss->write_bitmap(tst, dfd, NameUpdate);
>>>>  			}
>>>> +			if (strcmp(c->update, "nodes") == 0) {
>>>> +				tst->nodes = c->nodes;
>>>> +				tst->ss->write_bitmap(tst, dfd, NodeNumUpdate);
>>>> +			}
>>>>     
>>>>         
>>> Doesn't there need to be some test that there is enough free space on all
>>> devices to store the extra bitmaps (when nodes is increasing)??
>>>
>>>   
>>>       
>> Agree, could you pls elaborate more about the test? I guess the test
>> need to be run
>> before write_bitmap1call the awrite(&afd, buf, n).
>>     
>
> Normally we don't increase the size of a bitmap while it is alive.
> When an array is reshaped, bitmap_resize in the kernel actually changes the
> bitmap chunk size if necessary so that the bitmap will fit in the available
> space. 
> To make the bitmap larger, the only current approach is to delete the bitmap
> and add a new one.  This is handled by add_internal_bitmap1 in mdadm.
> So make the bitmap bigger you will need to perform similar calculations taht
> add_internal_bitmap1 performs, though instead of choosing a size or
> chunksize, you see need to check if the space is sufficient.
>
>   
Thanks a lot for the detailed infos. Seems the size of bitmap is one of
4k, 64k and
128k (per choose_bm_space), is it ok to call choose_bm_space to get
current bitmap
size? And since the size of each bitmap can be get by bitmap_bits, then
we can
know the current size of bitmap is enough or not by something like

int room = choose_bm_space(__le64_to_cpu(sb->size));
int total_bm_size_byte = room / 2 * 1024;
if (total_bm_size_byte > nodenums * bitmap_bits() / 8) {
/* enough space */
continue the write_bitmap
} else {
/* no enough space*/
Just warn it.
}

I think we should not allocate more space for extra nodes if current
size is not bigger
than required, otherwise it is possible to violate the meaning of
choose_bm_space.
Please correct me if I misunderstood something.

Thanks,
Guoqing

> Hope that helps,
> NeilBrown
>   


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/10] Create n bitmaps for clustered mode
  2015-04-30  2:51       ` NeilBrown
@ 2015-04-30 12:44         ` Goldwyn Rodrigues
  0 siblings, 0 replies; 31+ messages in thread
From: Goldwyn Rodrigues @ 2015-04-30 12:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: gqjiang, linux-raid



On 04/29/2015 09:51 PM, NeilBrown wrote:
> On Tue, 28 Apr 2015 21:41:47 -0500 Goldwyn Rodrigues <rgoldwyn@suse.de> wrote:
>
>>
>>
>> On 04/28/2015 08:36 PM, NeilBrown wrote:
>>> On Fri, 24 Apr 2015 15:30:34 +0800 gqjiang@suse.com wrote:
>>>
>>>> From: Guoqing Jiang <gqjiang@suse.com>
>>>>
>>>> For a clustered MD, create bitmaps equal to number of nodes so
>>>> each node has an independent bitmap.
>>>>
>>>> Only the first bitmap is has the bits set so that the first node
>>>> that assembles the device also performs the sync.
>>>>
>>>> The bitmaps are aligned to 4k boundaries.
>>>>
>>>> On-disk format:
>>>>
>>>> 0                    4k                     8k                    12k
>>>> -------------------------------------------------------------------
>>>> | idle                | md super            | bm super [0] + bits |
>>>> | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
>>>> | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
>>>> | bm bits [3, contd]  |                     |                     |
>>>>
>>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>>>> ---
>>>>    Create.c   |  3 ++-
>>>>    bitmap.h   |  7 +++++--
>>>>    mdadm.8.in |  7 ++++++-
>>>>    mdadm.c    | 17 ++++++++++++++++-
>>>>    super1.c   | 59 +++++++++++++++++++++++++++++++++++++++++------------------
>>>>    5 files changed, 70 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/Create.c b/Create.c
>>>> index cd5485b..9663dc4 100644
>>>> --- a/Create.c
>>>> +++ b/Create.c
>>>> @@ -752,7 +752,8 @@ int Create(struct supertype *st, char *mddev,
>>>>    #endif
>>>>    	}
>>>>
>>>> -	if (s->bitmap_file && strcmp(s->bitmap_file, "internal")==0) {
>>>> +	if (s->bitmap_file && (strcmp(s->bitmap_file, "internal")==0
>>>> +			|| strcmp(s->bitmap_file, "clustered")==0)) {
>>>>    		if ((vers%100) < 2) {
>>>>    			pr_err("internal bitmaps not supported by this kernel.\n");
>>>>    			goto abort_locked;
>>>> diff --git a/bitmap.h b/bitmap.h
>>>> index c8725a3..adbf0b4 100644
>>>> --- a/bitmap.h
>>>> +++ b/bitmap.h
>>>> @@ -154,8 +154,11 @@ typedef struct bitmap_super_s {
>>>>    	__u32 chunksize;    /* 52  the bitmap chunk size in bytes */
>>>>    	__u32 daemon_sleep; /* 56  seconds between disk flushes */
>>>>    	__u32 write_behind; /* 60  number of outstanding write-behind writes */
>>>> -
>>>> -	__u8  pad[256 - 64]; /* set to zero */
>>>> +	__u32 sectors_reserved; /* 64 number of 512-byte sectors that are
>>>> +				 * reserved for the bitmap. */
>>>> +	__u32 nodes;        /* 68 the maximum number of nodes in cluster. */
>>>> +	__u8 cluster_name[64]; /* 72 cluster name to which this md belongs */
>>>> +	__u8  pad[256 - 136]; /* set to zero */
>>>>    } bitmap_super_t;
>>>>
>>>>    /* notes:
>>>> diff --git a/mdadm.8.in b/mdadm.8.in
>>>> index a0e8288..c015cbf 100644
>>>> --- a/mdadm.8.in
>>>> +++ b/mdadm.8.in
>>>> @@ -700,7 +700,12 @@ and so is replicated on all devices.  If the word
>>>>    .B "none"
>>>>    is given with
>>>>    .B \-\-grow
>>>> -mode, then any bitmap that is present is removed.
>>>> +mode, then any bitmap that is present is removed. If the word
>>>> +.B "clustered"
>>>> +is given, the array is created for a clustered environment. One bitmap
>>>> +is created for each node as defined by the
>>>> +.B \-\-nodes
>>>> +parameter and are stored internally.
>>>>
>>>>    To help catch typing errors, the filename must contain at least one
>>>>    slash ('/') if it is a real file (not 'internal' or 'none').
>>>> diff --git a/mdadm.c b/mdadm.c
>>>> index e4f8568..6963a09 100644
>>>> --- a/mdadm.c
>>>> +++ b/mdadm.c
>>>> @@ -1111,6 +1111,15 @@ int main(int argc, char *argv[])
>>>>    				s.bitmap_file = optarg;
>>>>    				continue;
>>>>    			}
>>>> +			if (strcmp(optarg, "clustered")== 0) {
>>>> +				s.bitmap_file = optarg;
>>>> +				/* Set the default number of cluster nodes
>>>> +				 * to 4 if not already set by user
>>>> +				 */
>>>> +				if (c.nodes < 1)
>>>> +					c.nodes = 4;
>>>> +				continue;
>>>> +			}
>>>>    			/* probable typo */
>>>>    			pr_err("bitmap file must contain a '/', or be 'internal', or 'none'\n"
>>>>    				"       not '%s'\n", optarg);
>>>> @@ -1404,7 +1413,13 @@ int main(int argc, char *argv[])
>>>>    		if (c.delay == 0)
>>>>    			c.delay = DEFAULT_BITMAP_DELAY;
>>>>
>>>> -		if (!strncmp(s.bitmap_file, "internal", 9) ||
>>>> +		if (!strncmp(s.bitmap_file, "clustered", 9)) {
>>>> +			if (s.level != 1) {
>>>> +				pr_err("--bitmap=clustered is currently supported with RAID mirror only\n");
>>>> +				rv = 1;
>>>> +				break;
>>>> +			}
>>>> +		} else if (!strncmp(s.bitmap_file, "internal", 9) ||
>>>>    			!strncmp(s.bitmap_file,"none", 4)) {
>>>>    			if (c.nodes) {
>>>>    				pr_err("--nodes argument is incompatible with --bitmap=%s.\n",
>>>> diff --git a/super1.c b/super1.c
>>>> index f0508fe..ac1b011 100644
>>>> --- a/super1.c
>>>> +++ b/super1.c
>>>> @@ -2144,6 +2144,10 @@ add_internal_bitmap1(struct supertype *st,
>>>>    	bms->daemon_sleep = __cpu_to_le32(delay);
>>>>    	bms->sync_size = __cpu_to_le64(size);
>>>>    	bms->write_behind = __cpu_to_le32(write_behind);
>>>> +	bms->nodes = __cpu_to_le32(st->nodes);
>>>> +	if (st->cluster_name)
>>>> +		strncpy((char *)bms->cluster_name,
>>>> +				st->cluster_name, strlen(st->cluster_name));
>>>>
>>>>    	*chunkp = chunk;
>>>>    	return 1;
>>>> @@ -2177,6 +2181,7 @@ static int write_bitmap1(struct supertype *st, int fd)
>>>>    	void *buf;
>>>>    	int towrite, n;
>>>>    	struct align_fd afd;
>>>> +	unsigned int i;
>>>>
>>>>    	init_afd(&afd, fd);
>>>>
>>>> @@ -2185,27 +2190,45 @@ static int write_bitmap1(struct supertype *st, int fd)
>>>>    	if (posix_memalign(&buf, 4096, 4096))
>>>>    		return -ENOMEM;
>>>>
>>>> -	memset(buf, 0xff, 4096);
>>>> -	memcpy(buf, (char *)bms, sizeof(bitmap_super_t));
>>>> -
>>>> -	towrite = __le64_to_cpu(bms->sync_size) / (__le32_to_cpu(bms->chunksize)>>9);
>>>> -	towrite = (towrite+7) >> 3; /* bits to bytes */
>>>> -	towrite += sizeof(bitmap_super_t);
>>>> -	towrite = ROUND_UP(towrite, 512);
>>>> -	while (towrite > 0) {
>>>> -		n = towrite;
>>>> -		if (n > 4096)
>>>> -			n = 4096;
>>>> -		n = awrite(&afd, buf, n);
>>>> -		if (n > 0)
>>>> -			towrite -= n;
>>>> +	/* We use bms->nodes as opposed to st->nodes to
>>>> +	 * be compatible with write-after-reads such as
>>>> +	 * the GROW operation.
>>>> +	 */
>>>> +	for (i = 0; i < __le32_to_cpu(bms->nodes); i++) {
>>>> +		/* Only the first bitmap should resync
>>>> +		 * the whole device
>>>> +		 */
>>>> +		if (i)
>>>> +			memset(buf, 0x00, 4096);
>>>>    		else
>>>> +			memset(buf, 0xff, 4096);
>>>
>>> Why is the first bitmap initialised to 0x00 and the others to 0xff?
>>> If there is a good reason it should be documented either in a comment in the
>>> code or in the changelog entry.
>>
>>
>> Rather, it is the reverse. The first one is initialized to 0xff and the
>> rest are set to 0x00.
>>
>> The reason is only the first node to assemble the device should perform
>> the resync (if --assume-clean is not provided). The comment is right
>> above the code. Perhaps I should be more elaborate with the comment.
>>
>>
>
> Hmmm... Perhaps I should read code with my eyes open!
>
> Not sure I agree though.  Why should the first node be special?
> What if node '0' doesn't get activated?
> I guess it always well because of the way numbers are assigned, but I'm not
> feeling very comfortable...

Yes, the first (zero'th) one will get activated first. The  cluster 
communication will guarantee that.

In any case, we do have fallback scenarios:

- In case of failure of the first node, the "alive" node will take over
- All bitmaps are checked by the kernel while assembling. This works for 
a total cluster failure as well.


> Thinking a bit more ... why do we set any bits to '1'?

This is how the original non-clustered code is, I just moved it to the 
zeroth node :)


> Why not just set BITMAP_STALE, and let the kernel figure things out.
>
> For the single-node case, BITMAP_STALE is the same as setting all the bits to
> one.
> For the cluster case, we can get BITMAP_STALE to do whatever we want. and we
> should make sure we handle it correctly anyway.
>
> So maybe mdadm should set BITMAP_STALE, and leave all the bits as 0.
>
> Thoughts?

Should it be set for all bitmaps? If yes, how should the second node 
behave on observing that BITMAP_STALE is set while assembling? I suppose 
we can clear the flag when the (first) node is reading all bitmaps.

I suppose with --assume-clean, we just not set the BITMAP_STALE. Right?


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-04-30 12:44 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-24  7:30 [PATCH 00/10] mdadm tool: add the support for cluster-md gqjiang
2015-04-24  7:30 ` [PATCH 01/10] Add nodes option while creating md gqjiang
2015-04-29  1:30   ` NeilBrown
2015-04-30  2:33     ` Guoqing Jiang
2015-04-24  7:30 ` [PATCH 02/10] home-cluster while creating an array gqjiang
2015-04-24  7:30 ` [PATCH 03/10] Create n bitmaps for clustered mode gqjiang
2015-04-29  1:36   ` NeilBrown
2015-04-29  2:41     ` Goldwyn Rodrigues
2015-04-30  2:51       ` NeilBrown
2015-04-30 12:44         ` Goldwyn Rodrigues
2015-04-29  1:41   ` NeilBrown
2015-04-30  2:44     ` Guoqing Jiang
2015-04-30  2:53       ` NeilBrown
2015-04-24  7:30 ` [PATCH 04/10] Show all bitmaps while examining bitmap gqjiang
2015-04-29  1:41   ` NeilBrown
2015-04-30  3:17     ` Guoqing Jiang
2015-04-30  4:45       ` NeilBrown
2015-04-24  7:30 ` [PATCH 05/10] Add a new clustered disk gqjiang
2015-04-29  1:45   ` NeilBrown
2015-04-30  3:20     ` Guoqing Jiang
2015-04-24  7:30 ` [PATCH 06/10] Convert a bitmap=none device to clustered gqjiang
2015-04-24  7:30 ` [PATCH 07/10] Skip clustered devices in incremental gqjiang
2015-04-24  7:30 ` [PATCH 08/10] mdadm: add the ability to change cluster name gqjiang
2015-04-29  1:50   ` NeilBrown
2015-04-30  3:22     ` Guoqing Jiang
2015-04-24  7:30 ` [PATCH 09/10] mdadm: change the num of cluster node gqjiang
2015-04-29  1:51   ` NeilBrown
2015-04-30  3:34     ` Guoqing Jiang
2015-04-30  6:47       ` NeilBrown
2015-04-30 10:04         ` Guoqing Jiang
2015-04-24  7:30 ` [PATCH 10/10] Reuse the write_bitmap for update uuid gqjiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).