Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: Tomasz Majchrzak @ 2016-10-20 14:26 UTC (permalink / raw)
  To: keld; +Cc: linux-raid
In-Reply-To: <20161020140220.GB17867@www5.open-std.org>

I cannot see how badblocks program is related to this patch. It is a generic
code for bad blocks support in IMSM metadata. It introduces 64-bit value for
sector address, the same size as in kernel. All it does is syncing kernel bad
block list with raid metadata.

Tomek

On Thu, Oct 20, 2016 at 04:02:20PM +0200, keld@keldix.com wrote:
> Is this safe for 2TB+ arrays?
> I think the badblocks program does not handle 2TB+ partitions?
> 
> best regards
> Keld
> 
> Best regards
> Keld
> 
> On Thu, Oct 20, 2016 at 03:03:46PM +0200, Tomasz Majchrzak wrote:
> > If metadata handler provides support for bad blocks, tell md by writing
> > 'external_bbl' to rdev state file (both on create and assemble),
> > followed by a list of known bad blocks written via sysfs 'bad_blocks'
> > file.
> > 
> > Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> > Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> > ---
> >  mdadm.h | 13 +++++++++++++
> >  sysfs.c | 29 ++++++++++++++++++++++++++++-
> >  2 files changed, 41 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mdadm.h b/mdadm.h
> > index 0516c82..5156ea4 100755
> > --- a/mdadm.h
> > +++ b/mdadm.h
> > @@ -237,6 +237,17 @@ struct dlm_lksb {
> >  
> >  extern const char Name[];
> >  
> > +struct md_bb_entry {
> > +	unsigned long long sector;
> > +	int length;
> > +};
> > +
> > +struct md_bb {
> > +	int supported;
> > +	int count;
> > +	struct md_bb_entry *entries;
> > +};
> > +
> >  /* general information that might be extracted from a superblock */
> >  struct mdinfo {
> >  	mdu_array_info_t	array;
> > @@ -311,6 +322,8 @@ struct mdinfo {
> >  
> >  	/* info read from sysfs */
> >  	char		sysfs_array_state[20];
> > +
> > +	struct md_bb bb;
> >  };
> >  
> >  struct createinfo {
> > diff --git a/sysfs.c b/sysfs.c
> > index d28e21a..c7a8e66 100644
> > --- a/sysfs.c
> > +++ b/sysfs.c
> > @@ -50,8 +50,12 @@ void sysfs_free(struct mdinfo *sra)
> >  		while (sra->devs) {
> >  			struct mdinfo *d = sra->devs;
> >  			sra->devs = d->next;
> > +			if (d->bb.entries)
> > +				free(d->bb.entries);
> >  			free(d);
> >  		}
> > +		if (sra->bb.entries)
> > +			free(sra->bb.entries);
> >  		free(sra);
> >  		sra = sra2;
> >  	}
> > @@ -259,7 +263,7 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
> >  		dbase = base + strlen(base);
> >  		*dbase++ = '/';
> >  
> > -		dev = xmalloc(sizeof(*dev));
> > +		dev = xcalloc(1, sizeof(*dev));
> >  
> >  		/* Always get slot, major, minor */
> >  		strcpy(dbase, "slot");
> > @@ -687,6 +691,7 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
> >  	char nm[PATH_MAX];
> >  	char *dname;
> >  	int rv;
> > +	int i;
> >  
> >  	sprintf(dv, "%d:%d", sd->disk.major, sd->disk.minor);
> >  	rv = sysfs_set_str(sra, NULL, "new_dev", dv);
> > @@ -718,6 +723,28 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
> >  		if (resume)
> >  			sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
> >  	}
> > +	if (sd->bb.supported) {
> > +		if (sysfs_set_str(sra, sd, "state", "external_bbl")) {
> > +			/*
> > +			 * backward compatibility - if kernel doesn't support
> > +			 * bad blocks for external metadata, let it continue
> > +			 * as long as there are none known so far
> > +			 */
> > +			if (sd->bb.count) {
> > +				pr_err("The kernel has no support for bad blocks in external metadata\n");
> > +				return -1;
> > +			}
> > +		}
> > +
> > +		for (i = 0; i < sd->bb.count; i++) {
> > +			char s[30];
> > +			const struct md_bb_entry *entry = &sd->bb.entries[i];
> > +
> > +			snprintf(s, sizeof(s) - 1, "%llu %d\n", entry->sector,
> > +				 entry->length);
> > +			rv |= sysfs_set_str(sra, sd, "bad_blocks", s);
> > +		}
> > +	}
> >  	return rv;
> >  }
> >  
> > -- 
> > 1.8.3.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: keld @ 2016-10-20 14:02 UTC (permalink / raw)
  To: Tomasz Majchrzak
  Cc: linux-raid, Jes.Sorensen, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt, mariusz.dabrowski
In-Reply-To: <1476968626-19233-1-git-send-email-tomasz.majchrzak@intel.com>

Hi 

Is this safe for 2TB+ arrays?
I think the badblocks program does not handle 2TB+ partitions?

best regards
Keld

Best regards
Keld

On Thu, Oct 20, 2016 at 03:03:46PM +0200, Tomasz Majchrzak wrote:
> If metadata handler provides support for bad blocks, tell md by writing
> 'external_bbl' to rdev state file (both on create and assemble),
> followed by a list of known bad blocks written via sysfs 'bad_blocks'
> file.
> 
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  mdadm.h | 13 +++++++++++++
>  sysfs.c | 29 ++++++++++++++++++++++++++++-
>  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/mdadm.h b/mdadm.h
> index 0516c82..5156ea4 100755
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -237,6 +237,17 @@ struct dlm_lksb {
>  
>  extern const char Name[];
>  
> +struct md_bb_entry {
> +	unsigned long long sector;
> +	int length;
> +};
> +
> +struct md_bb {
> +	int supported;
> +	int count;
> +	struct md_bb_entry *entries;
> +};
> +
>  /* general information that might be extracted from a superblock */
>  struct mdinfo {
>  	mdu_array_info_t	array;
> @@ -311,6 +322,8 @@ struct mdinfo {
>  
>  	/* info read from sysfs */
>  	char		sysfs_array_state[20];
> +
> +	struct md_bb bb;
>  };
>  
>  struct createinfo {
> diff --git a/sysfs.c b/sysfs.c
> index d28e21a..c7a8e66 100644
> --- a/sysfs.c
> +++ b/sysfs.c
> @@ -50,8 +50,12 @@ void sysfs_free(struct mdinfo *sra)
>  		while (sra->devs) {
>  			struct mdinfo *d = sra->devs;
>  			sra->devs = d->next;
> +			if (d->bb.entries)
> +				free(d->bb.entries);
>  			free(d);
>  		}
> +		if (sra->bb.entries)
> +			free(sra->bb.entries);
>  		free(sra);
>  		sra = sra2;
>  	}
> @@ -259,7 +263,7 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
>  		dbase = base + strlen(base);
>  		*dbase++ = '/';
>  
> -		dev = xmalloc(sizeof(*dev));
> +		dev = xcalloc(1, sizeof(*dev));
>  
>  		/* Always get slot, major, minor */
>  		strcpy(dbase, "slot");
> @@ -687,6 +691,7 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
>  	char nm[PATH_MAX];
>  	char *dname;
>  	int rv;
> +	int i;
>  
>  	sprintf(dv, "%d:%d", sd->disk.major, sd->disk.minor);
>  	rv = sysfs_set_str(sra, NULL, "new_dev", dv);
> @@ -718,6 +723,28 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
>  		if (resume)
>  			sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
>  	}
> +	if (sd->bb.supported) {
> +		if (sysfs_set_str(sra, sd, "state", "external_bbl")) {
> +			/*
> +			 * backward compatibility - if kernel doesn't support
> +			 * bad blocks for external metadata, let it continue
> +			 * as long as there are none known so far
> +			 */
> +			if (sd->bb.count) {
> +				pr_err("The kernel has no support for bad blocks in external metadata\n");
> +				return -1;
> +			}
> +		}
> +
> +		for (i = 0; i < sd->bb.count; i++) {
> +			char s[30];
> +			const struct md_bb_entry *entry = &sd->bb.entries[i];
> +
> +			snprintf(s, sizeof(s) - 1, "%llu %d\n", entry->sector,
> +				 entry->length);
> +			rv |= sysfs_set_str(sra, sd, "bad_blocks", s);
> +		}
> +	}
>  	return rv;
>  }
>  
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 4/4] mdmon: bad block support for external metadata - clear bad blocks
From: Tomasz Majchrzak @ 2016-10-20 13:06 UTC (permalink / raw)
  To: linux-raid
  Cc: Jes.Sorensen, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt, mariusz.dabrowski,
	Tomasz Majchrzak

If an update of acknowledged bad blocks file is notified, read entire
bad block list from sysfs file and compare it against local list of bad
blocks. If any obsolete entries are found, remove them from metadata.

As mdmon cannot perform any memory allocation, new superswitch method
get_bad_blocks is expected to return a list of bad blocks in metadata
without allocating memory. It's up to metadata handler to allocate all
required memory in advance.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 mdadm.h   |  7 ++++++
 monitor.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/mdadm.h b/mdadm.h
index 05a2e3e..7f1a1b8 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1054,6 +1054,13 @@ extern struct superswitch {
 	int (*record_bad_block)(struct active_array *a, int n,
 					unsigned long long sector, int length);
 
+	/* clears bad block from metadata */
+	int (*clear_bad_block)(struct active_array *a, int n,
+					unsigned long long sector, int length);
+
+	/* get list of bad blocks from metadata */
+	struct md_bb *(*get_bad_blocks)(struct active_array *a, int n);
+
 	int swapuuid; /* true if uuid is bigending rather than hostendian */
 	int external;
 	const char *name; /* canonical metadata name */
diff --git a/monitor.c b/monitor.c
index 981da5b..1704a59 100644
--- a/monitor.c
+++ b/monitor.c
@@ -33,6 +33,7 @@ static char *sync_actions[] = {
 
 enum bb_action {
 	RECORD_BB = 1,
+	COMPARE_BB,
 };
 
 static int write_attr(char *attr, int fd)
@@ -184,6 +185,49 @@ int process_ubb(struct active_array *a, struct mdinfo *mdi, const unsigned long
 	return -1;
 }
 
+int compare_bb(struct active_array *a, struct mdinfo *mdi, const unsigned long
+	       long sector, const unsigned int length, void *arg)
+{
+	struct superswitch *ss = a->container->ss;
+	struct md_bb *bb = (struct md_bb *) arg;
+	int record = 1;
+	int i;
+
+	for (i = 0; i < bb->count; i++) {
+		unsigned long long start = bb->entries[i].sector;
+		unsigned long long len = bb->entries[i].length;
+
+		/*
+		 * bad block in metadata exactly matches bad block in kernel
+		 * list, just remove it from a list
+		 */
+		if ((start == sector) && (len == length)) {
+			if (i < bb->count - 1)
+				bb->entries[i] = bb->entries[bb->count - 1];
+			bb->count -= 1;
+			record = 0;
+			break;
+		}
+		/*
+		 * bad block in metadata spans bad block in kernel list,
+		 * clear it and record new bad block
+		 */
+		if ((sector >= start) && (sector + length <= start + len)) {
+			ss->clear_bad_block(a, mdi->disk.raid_disk, start, len);
+			break;
+		}
+	}
+
+	/* record all bad blocks not in metadata list */
+	if (record && (ss->record_bad_block(a, mdi->disk.raid_disk, sector,
+					     length) <= 0)) {
+		sysfs_set_str(&a->info, mdi, "state", "-external_bbl");
+		return -1;
+	}
+
+	return 1;
+}
+
 static int read_bb_file(int fd, struct active_array *a, struct mdinfo *mdi,
 			enum bb_action action, void *arg)
 {
@@ -242,6 +286,8 @@ static int read_bb_file(int fd, struct active_array *a, struct mdinfo *mdi,
 			if (action == RECORD_BB)
 				rc = process_ubb(a, mdi, sector, length,
 						  buf + off, consumed);
+			else if (action == COMPARE_BB)
+				rc = compare_bb(a, mdi, sector, length, arg);
 			else
 				rc = -1;
 
@@ -260,6 +306,34 @@ static int process_dev_ubb(struct active_array *a, struct mdinfo *mdi)
 	return read_bb_file(mdi->ubb_fd, a, mdi, RECORD_BB, NULL);
 }
 
+static int check_for_cleared_bb(struct active_array *a, struct mdinfo *mdi)
+{
+	struct superswitch *ss = a->container->ss;
+	struct md_bb *bb;
+	int i;
+
+	/*
+	 * Get a list of bad blocks for an array, then read list of
+	 * acknowledged bad blocks from kernel and compare it against metadata
+	 * list, clear all bad blocks remaining in metadata list
+	 */
+	bb = ss->get_bad_blocks(a, mdi->disk.raid_disk);
+	if (!bb)
+		return -1;
+
+	if (read_bb_file(mdi->bb_fd, a, mdi, COMPARE_BB, bb) < 0)
+		return -1;
+
+	for (i = 0; i < bb->count; i++) {
+		unsigned long long sector = bb->entries[i].sector;
+		int length = bb->entries[i].length;
+
+		ss->clear_bad_block(a, mdi->disk.raid_disk, sector, length);
+	}
+
+	return 0;
+}
+
 static void signal_manager(void)
 {
 	/* tgkill(getpid(), mon_tid, SIGUSR1); */
@@ -326,7 +400,7 @@ static void signal_manager(void)
 
 #define ARRAY_DIRTY 1
 #define ARRAY_BUSY 2
-static int read_and_act(struct active_array *a)
+static int read_and_act(struct active_array *a, fd_set *fds)
 {
 	unsigned long long sync_completed;
 	int check_degraded = 0;
@@ -369,6 +443,8 @@ static int read_and_act(struct active_array *a)
 			mdi->curr_state &= ~DS_FAULTY;
 			mdi->next_state |= DS_UNBLOCK;
 		}
+		if (FD_ISSET(mdi->bb_fd, fds))
+			check_for_cleared_bb(a, mdi);
 	}
 
 	gettimeofday(&tv, NULL);
@@ -755,6 +831,7 @@ static int wait_and_act(struct supertype *container, int nowait)
 		if (rv == -1) {
 			if (errno == EINTR) {
 				rv = 0;
+				FD_ZERO(&rfds);
 				dprintf("monitor: caught signal\n");
 			} else
 				dprintf("monitor: error %d in pselect\n",
@@ -796,7 +873,7 @@ static int wait_and_act(struct supertype *container, int nowait)
 			signal_manager();
 		}
 		if (a->container && !a->to_remove) {
-			int ret = read_and_act(a);
+			int ret = read_and_act(a, &rfds);
 			rv |= 1;
 			dirty_arrays += !!(ret & ARRAY_DIRTY);
 			/* when terminating stop manipulating the array after it
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 3/4] mdmon: bad block support for external metadata - store bad blocks
From: Tomasz Majchrzak @ 2016-10-20 13:05 UTC (permalink / raw)
  To: linux-raid
  Cc: Jes.Sorensen, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt, mariusz.dabrowski,
	Tomasz Majchrzak

If md has changed the state to 'blocked' and metadata handler supports
bad blocks, try process them first. If metadata handler has successfully
stored bad block, acknowledge it to md via 'badblocks' sysfs file. If
metadata handler has failed to store the new bad block (ie. lack of
space), remove bad block support for a disk by writing "-external_bbl"
to state sysfs file. If all bad blocks have been acknowledged, do not
fail the disk but clear 'faulty' flag and request to unblock the array.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Acked-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 mdadm.h   |   4 +++
 monitor.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/mdadm.h b/mdadm.h
index 1a1c7af..05a2e3e 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -1050,6 +1050,10 @@ extern struct superswitch {
 	/* validate container after assemble */
 	int (*validate_container)(struct mdinfo *info);
 
+	/* records new bad block in metadata */
+	int (*record_bad_block)(struct active_array *a, int n,
+					unsigned long long sector, int length);
+
 	int swapuuid; /* true if uuid is bigending rather than hostendian */
 	int external;
 	const char *name; /* canonical metadata name */
diff --git a/monitor.c b/monitor.c
index 56c6500..981da5b 100644
--- a/monitor.c
+++ b/monitor.c
@@ -31,6 +31,10 @@ static char *sync_actions[] = {
 	"idle", "reshape", "resync", "recover", "check", "repair", NULL
 };
 
+enum bb_action {
+	RECORD_BB = 1,
+};
+
 static int write_attr(char *attr, int fd)
 {
 	return write(fd, attr, strlen(attr));
@@ -158,6 +162,104 @@ int read_dev_state(int fd)
 	return rv;
 }
 
+int process_ubb(struct active_array *a, struct mdinfo *mdi, const unsigned long
+		long sector, const int length, const char *buf,
+		const int buf_len)
+{
+	struct superswitch *ss = a->container->ss;
+
+	/*
+	 * record bad block in metadata first, then acknowledge it to the driver
+	 * via sysfs file
+	 */
+	if ((ss->record_bad_block(a, mdi->disk.raid_disk, sector, length)) &&
+	    (write(mdi->bb_fd, buf, buf_len) == buf_len))
+		return 1;
+
+	/*
+	 * failed to store or acknowledge bad block, switch of bad block support
+	 * to get it out of blocked state
+	 */
+	sysfs_set_str(&a->info, mdi, "state", "-external_bbl");
+	return -1;
+}
+
+static int read_bb_file(int fd, struct active_array *a, struct mdinfo *mdi,
+			enum bb_action action, void *arg)
+{
+	char buf[30];
+	int n = 0;
+	int ret = 0;
+	int read_again = 0;
+	int off = 0;
+	int pos = 0;
+	int preserve_pos = (action == RECORD_BB ? 0 : 1);
+
+	if (lseek(fd, 0, SEEK_SET) == (off_t) -1)
+		return -1;
+
+	do {
+		read_again = 0;
+		n = read(fd, buf + pos, sizeof(buf) - 1 - pos);
+		if (n < 0)
+			return -1;
+		n += pos;
+
+		buf[n] = '\0';
+		off = 0;
+
+		while (off < n) {
+			unsigned long long sector;
+			int length;
+			char newline;
+			int consumed;
+			int matched;
+			int rc;
+
+			/* kernel sysfs file format: "sector length\n" */
+			matched = sscanf(buf + off, "%llu %d%c%n", &sector,
+					 &length, &newline, &consumed);
+			if ((matched != 3) && (off > 0)) {
+				/* truncated entry, read again */
+				if (preserve_pos) {
+					pos = sizeof(buf) - off - 1;
+					memmove(buf, buf + off, pos);
+				} else {
+					if (lseek(fd, 0, SEEK_SET) ==
+					    (off_t) -1)
+						return -1;
+				}
+				read_again = 1;
+				break;
+			}
+			if (matched != 3)
+				return -1;
+			if (newline != '\n')
+				return -1;
+			if (length <= 0)
+				return -1;
+
+			if (action == RECORD_BB)
+				rc = process_ubb(a, mdi, sector, length,
+						  buf + off, consumed);
+			else
+				rc = -1;
+
+			if (rc < 0)
+				return rc;
+			ret += rc;
+			off += consumed;
+		}
+	} while (read_again);
+
+	return ret;
+}
+
+static int process_dev_ubb(struct active_array *a, struct mdinfo *mdi)
+{
+	return read_bb_file(mdi->ubb_fd, a, mdi, RECORD_BB, NULL);
+}
+
 static void signal_manager(void)
 {
 	/* tgkill(getpid(), mon_tid, SIGUSR1); */
@@ -256,6 +358,17 @@ static int read_and_act(struct active_array *a)
 					  &mdi->recovery_start);
 			mdi->curr_state = read_dev_state(mdi->state_fd);
 		}
+		/*
+		 * If array is blocked and metadata handler is able to handle
+		 * BB, check if you can acknowledge them to md driver. If
+		 * successful, clear faulty state and unblock the array.
+		 */
+		if ((mdi->curr_state & DS_BLOCKED) &&
+		    a->container->ss->record_bad_block &&
+		    (process_dev_ubb(a, mdi) > 0)) {
+			mdi->curr_state &= ~DS_FAULTY;
+			mdi->next_state |= DS_UNBLOCK;
+		}
 	}
 
 	gettimeofday(&tv, NULL);
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 2/4] mdmon: bad block support for external metadata - sysfs file open
From: Tomasz Majchrzak @ 2016-10-20 13:04 UTC (permalink / raw)
  To: linux-raid
  Cc: Jes.Sorensen, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt, mariusz.dabrowski,
	Tomasz Majchrzak

Open 'badblocks' and 'unacknowledged_bad_blocks' sysfs files for each
disk in the array. Add them to the list of files observed by monitor.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 managemon.c | 17 +++++++++++++++++
 mdadm.h     |  2 ++
 monitor.c   |  7 ++++++-
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/managemon.c b/managemon.c
index 6d1b3d8..3c1d4cb 100644
--- a/managemon.c
+++ b/managemon.c
@@ -115,6 +115,8 @@ static void close_aa(struct active_array *aa)
 	for (d = aa->info.devs; d; d = d->next) {
 		close(d->recovery_fd);
 		close(d->state_fd);
+		close(d->bb_fd);
+		close(d->ubb_fd);
 	}
 
 	if (aa->action_fd >= 0)
@@ -433,6 +435,21 @@ static int disk_init_and_add(struct mdinfo *disk, struct mdinfo *clone,
 		close(disk->recovery_fd);
 		return -1;
 	}
+	disk->bb_fd = sysfs_open2(aa->info.sys_name, disk->sys_name,
+				 "bad_blocks");
+	if (disk->bb_fd < 0) {
+		close(disk->recovery_fd);
+		close(disk->state_fd);
+		return -1;
+	}
+	disk->ubb_fd = sysfs_open2(aa->info.sys_name, disk->sys_name,
+				  "unacknowledged_bad_blocks");
+	if (disk->ubb_fd < 0) {
+		close(disk->recovery_fd);
+		close(disk->state_fd);
+		close(disk->bb_fd);
+		return -1;
+	}
 	disk->prev_state = read_dev_state(disk->state_fd);
 	disk->curr_state = disk->prev_state;
 	disk->next = aa->info.devs;
diff --git a/mdadm.h b/mdadm.h
index 5156ea4..1a1c7af 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -311,6 +311,8 @@ struct mdinfo {
 	/* Device info for mdmon: */
 	int recovery_fd;
 	int state_fd;
+	int bb_fd;
+	int ubb_fd;
 	#define DS_FAULTY	1
 	#define	DS_INSYNC	2
 	#define	DS_WRITE_MOSTLY	4
diff --git a/monitor.c b/monitor.c
index 4c79ce2..56c6500 100644
--- a/monitor.c
+++ b/monitor.c
@@ -454,6 +454,8 @@ static int read_and_act(struct active_array *a)
 				dprintf_cont(" %d:removed", mdi->disk.raid_disk);
 				close(mdi->state_fd);
 				close(mdi->recovery_fd);
+				close(mdi->bb_fd);
+				close(mdi->ubb_fd);
 				mdi->state_fd = -1;
 			} else
 				ret |= ARRAY_BUSY;
@@ -583,8 +585,11 @@ static int wait_and_act(struct supertype *container, int nowait)
 		add_fd(&rfds, &maxfd, a->info.state_fd);
 		add_fd(&rfds, &maxfd, a->action_fd);
 		add_fd(&rfds, &maxfd, a->sync_completed_fd);
-		for (mdi = a->info.devs ; mdi ; mdi = mdi->next)
+		for (mdi = a->info.devs ; mdi ; mdi = mdi->next) {
 			add_fd(&rfds, &maxfd, mdi->state_fd);
+			add_fd(&rfds, &maxfd, mdi->bb_fd);
+			add_fd(&rfds, &maxfd, mdi->ubb_fd);
+		}
 
 		ap = &(*ap)->next;
 	}
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 1/4] mdadm: bad block support for external metadata - initialization
From: Tomasz Majchrzak @ 2016-10-20 13:03 UTC (permalink / raw)
  To: linux-raid
  Cc: Jes.Sorensen, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt, mariusz.dabrowski,
	Tomasz Majchrzak

If metadata handler provides support for bad blocks, tell md by writing
'external_bbl' to rdev state file (both on create and assemble),
followed by a list of known bad blocks written via sysfs 'bad_blocks'
file.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 mdadm.h | 13 +++++++++++++
 sysfs.c | 29 ++++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/mdadm.h b/mdadm.h
index 0516c82..5156ea4 100755
--- a/mdadm.h
+++ b/mdadm.h
@@ -237,6 +237,17 @@ struct dlm_lksb {
 
 extern const char Name[];
 
+struct md_bb_entry {
+	unsigned long long sector;
+	int length;
+};
+
+struct md_bb {
+	int supported;
+	int count;
+	struct md_bb_entry *entries;
+};
+
 /* general information that might be extracted from a superblock */
 struct mdinfo {
 	mdu_array_info_t	array;
@@ -311,6 +322,8 @@ struct mdinfo {
 
 	/* info read from sysfs */
 	char		sysfs_array_state[20];
+
+	struct md_bb bb;
 };
 
 struct createinfo {
diff --git a/sysfs.c b/sysfs.c
index d28e21a..c7a8e66 100644
--- a/sysfs.c
+++ b/sysfs.c
@@ -50,8 +50,12 @@ void sysfs_free(struct mdinfo *sra)
 		while (sra->devs) {
 			struct mdinfo *d = sra->devs;
 			sra->devs = d->next;
+			if (d->bb.entries)
+				free(d->bb.entries);
 			free(d);
 		}
+		if (sra->bb.entries)
+			free(sra->bb.entries);
 		free(sra);
 		sra = sra2;
 	}
@@ -259,7 +263,7 @@ struct mdinfo *sysfs_read(int fd, char *devnm, unsigned long options)
 		dbase = base + strlen(base);
 		*dbase++ = '/';
 
-		dev = xmalloc(sizeof(*dev));
+		dev = xcalloc(1, sizeof(*dev));
 
 		/* Always get slot, major, minor */
 		strcpy(dbase, "slot");
@@ -687,6 +691,7 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
 	char nm[PATH_MAX];
 	char *dname;
 	int rv;
+	int i;
 
 	sprintf(dv, "%d:%d", sd->disk.major, sd->disk.minor);
 	rv = sysfs_set_str(sra, NULL, "new_dev", dv);
@@ -718,6 +723,28 @@ int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume)
 		if (resume)
 			sysfs_set_num(sra, sd, "recovery_start", sd->recovery_start);
 	}
+	if (sd->bb.supported) {
+		if (sysfs_set_str(sra, sd, "state", "external_bbl")) {
+			/*
+			 * backward compatibility - if kernel doesn't support
+			 * bad blocks for external metadata, let it continue
+			 * as long as there are none known so far
+			 */
+			if (sd->bb.count) {
+				pr_err("The kernel has no support for bad blocks in external metadata\n");
+				return -1;
+			}
+		}
+
+		for (i = 0; i < sd->bb.count; i++) {
+			char s[30];
+			const struct md_bb_entry *entry = &sd->bb.entries[i];
+
+			snprintf(s, sizeof(s) - 1, "%llu %d\n", entry->sector,
+				 entry->length);
+			rv |= sysfs_set_str(sra, sd, "bad_blocks", s);
+		}
+	}
 	return rv;
 }
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: MD-RAID: Use seq_putc() in three status functions?
From: SF Markus Elfring @ 2016-10-20 12:24 UTC (permalink / raw)
  To: Hannes Reinecke, linux-raid
  Cc: Bernd Petrovitsch, Christoph Hellwig, Guoqing Jiang, Jens Axboe,
	Joe Perches, Mike Christie, Neil Brown, Shaohua Li,
	Tomasz Majchrzak, LKML, kernel-janitors, kbuild-all, ltp
In-Reply-To: <cc24736e-2227-2a49-e131-386775c942d9@users.sourceforge.net>

>> So back to the original task for you: Show me in the generated output where the benefits are.

I can offer another bit of information for this software development discussion.

The following build settings were active in my "Makefile" for this Linux test case.

…
HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O0 -fomit-frame-pointer -std=gnu89
…

The afffected source files can be compiled for the processor architecture "x86_64"
by a tool like "GCC 6.2.1+r239849-1.4" from the software distribution
"openSUSE Tumbleweed" with the following command example.

my_original=${my_build_dir}unchanged/test/ \
&& my_fixing=${my_build_dir}patched/test/ \
&& mkdir -p ${my_original} ${my_fixing} \
&& my_cc=/usr/bin/gcc-6 \
&& my_module=drivers/md/raid1.s \
&& git checkout next-20161014 \
&& make -j6 O="${my_original}" HOSTCC="${my_cc}" allmodconfig ${my_module} \
&& git checkout next_usage_of_seq_putc_in_md_raid_1 \
&& make -j6 O="${my_fixing}" HOSTCC="${my_cc}" allmodconfig ${my_module} \
&& diff -u "${my_original}${my_module}" "${my_fixing}${my_module}" > "${my_build_dir}assembler_code_comparison_$(date -I)_1.diff"

Unfortunately, the generated file got the size "311 KiB". I guess that
this is too big to send such a file around on the Linux mailing list.

Is this kind of assembler code comparison still useful to clarify relevant
differences further?

Regards,
Markus

^ permalink raw reply

* Re: [PATCH 3/3 v2] md: unblock array if bad blocks have been acknowledged
From: Tomasz Majchrzak @ 2016-10-20 12:09 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20161020052818.GA17974@kernel.org>

On Wed, Oct 19, 2016 at 10:28:18PM -0700, Shaohua Li wrote:
> On Tue, Oct 18, 2016 at 04:10:24PM +0200, Tomasz Majchrzak wrote:
> > Once external metadata handler acknowledges all bad blocks (by writing
> > to rdev 'bad_blocks' sysfs file), it requests to unblock the array.
> > Check if all bad blocks are actually acknowledged as there might be a
> > race if new bad blocks are notified at the same time. If all bad blocks
> > are acknowledged, just unblock the array and continue. If not, ignore
> > the request to unblock (do not fail an array). External metadata handler
> > is expected to either process remaining bad blocks and try to unblock
> > again or remove bad block support for a disk (which will cause disk to
> > fail as in no-support case).
> > 
> > Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> > Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> > ---
> >  drivers/md/md.c | 24 +++++++++++++++++-------
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index cc05236..ce585b7 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -2612,19 +2612,29 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
> >  		set_bit(Blocked, &rdev->flags);
> >  		err = 0;
> >  	} else if (cmd_match(buf, "-blocked")) {
> > -		if (!test_bit(Faulty, &rdev->flags) &&
> > +		int unblock = 1;
> > +		int acked = !rdev->badblocks.unacked_exist;
> > +
> > +		if ((test_bit(ExternalBbl, &rdev->flags) &&
> > +		     rdev->badblocks.changed))
> > +			acked = check_if_badblocks_acked(&rdev->badblocks);
> > +
> > +		if (test_bit(ExternalBbl, &rdev->flags) && !acked) {
> > +			unblock = 0;
> > +		} else if (!test_bit(Faulty, &rdev->flags) &&
> 
> I missed one thing in last review. writing to bad_blocks sysfs file already
> clears the BlockedBadBlocks bit and wakeup the thread sleeping at blocked_wait,
> so the array can continue. Why do we need to fix state_store here?

We cannot unblock the rdev until all bad blocks are acknowledged. The problem is
mdadm cannot be sure it has stored all bad blocks in the first pass (read of
unacknowledged_bad_blocks sysfs file). When bad block is encountered, rdev
enters Blocked, Faulty state (unacked_exists is non-zero in state_show). Then
mdadm reads the bad block, stores it in metadata and acknowledges it to the
kernel. Initially I have tried to call ack_all_badblocks in bb_store or in
state_store("-blocked") but there was a race. If other requests (the ones that
had started before array got into blocked state) notified bad blocks after sysfs
file was read by mdadm but before ack_all_badblocks call, ack_all_badblocks call
was also acknowledging bad blocks not stored (and never to be as a result) in
metadata. That's why I have introduced a new function
check_if_all_badblocks_acked to close this race.

I'm not sure if native bad block support is not prone to the similar problem -
bad block structure modified between metadata sync and ack_all_badblocks call.

As for BlockedBadBlocks flag cleared in bb_store, commit de393cdea66c ("md: make
it easier to wait for bad blocks to be acknowledged") explains this flag is only
an advisory. All awaiting requests are woken up and check if bad block that
interests them is already acknowledged. If so, then can continue, and if not,
they set the flag again to check in a while. It is just a useful optimization.

Please note that rdev with unacknowledged bad block is reported as Blocked via
sysfs state (non-zero unacked_exists), even though the corresponding rdev kernel
flag is not set. It is the reason why mdadm calls state_store("-blocked").

I hope it clarifies all your doubts.

Tomek

^ permalink raw reply

* Re: [PATCH 3/3 v2] md: unblock array if bad blocks have been acknowledged
From: Shaohua Li @ 2016-10-20  5:28 UTC (permalink / raw)
  To: Tomasz Majchrzak
  Cc: linux-raid, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt, mariusz.dabrowski, axboe,
	dan.j.williams, linux-block
In-Reply-To: <1476799824-6498-1-git-send-email-tomasz.majchrzak@intel.com>

On Tue, Oct 18, 2016 at 04:10:24PM +0200, Tomasz Majchrzak wrote:
> Once external metadata handler acknowledges all bad blocks (by writing
> to rdev 'bad_blocks' sysfs file), it requests to unblock the array.
> Check if all bad blocks are actually acknowledged as there might be a
> race if new bad blocks are notified at the same time. If all bad blocks
> are acknowledged, just unblock the array and continue. If not, ignore
> the request to unblock (do not fail an array). External metadata handler
> is expected to either process remaining bad blocks and try to unblock
> again or remove bad block support for a disk (which will cause disk to
> fail as in no-support case).
> 
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/md.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index cc05236..ce585b7 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2612,19 +2612,29 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>  		set_bit(Blocked, &rdev->flags);
>  		err = 0;
>  	} else if (cmd_match(buf, "-blocked")) {
> -		if (!test_bit(Faulty, &rdev->flags) &&
> +		int unblock = 1;
> +		int acked = !rdev->badblocks.unacked_exist;
> +
> +		if ((test_bit(ExternalBbl, &rdev->flags) &&
> +		     rdev->badblocks.changed))
> +			acked = check_if_badblocks_acked(&rdev->badblocks);
> +
> +		if (test_bit(ExternalBbl, &rdev->flags) && !acked) {
> +			unblock = 0;
> +		} else if (!test_bit(Faulty, &rdev->flags) &&

I missed one thing in last review. writing to bad_blocks sysfs file already
clears the BlockedBadBlocks bit and wakeup the thread sleeping at blocked_wait,
so the array can continue. Why do we need to fix state_store here?

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v5 7/8] md/r5cache: r5c recovery
From: NeilBrown @ 2016-10-20  3:03 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161013054944.1038806-8-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 3060 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> This is the recovery part of raid5-cache.
>
> With cache feature, there are 2 different scenarios of recovery:
> 1. Data-Parity stripe: a stripe with complete parity in journal.
> 2. Data-Only stripe: a stripe with only data in journal (or partial
>    parity).
>
> The code differentiate Data-Parity stripe from Data-Only stripe with
> flag (STRIPE_R5C_WRITTEN).
>
> For Data-Parity stripes, we use the same procedure as raid5 journal,
> where all the data and parity are replayed to the RAID devices.
>
> For Data-Only strips, we need to finish complete calculate parity and
> finish the full reconstruct write or RMW write. For simplicity, in
> the recovery, we load the stripe to stripe cache. Once the array is
> started, the stripe cache state machine will handle these stripes
> through normal write path.
>
> r5c_recovery_flush_log contains the main procedure of recovery. The
> recovery code first scans through the journal and loads data to
> stripe cache. The code keeps tracks of all these stripes in a list
> (use sh->lru and ctx->cached_list), stripes in the list are
> organized in the order of its first appearance on the journal.
> During the scan, the recovery code assesses each stripe as
> Data-Parity or Data-Only.
>
> During scan, the array may run out of stripe cache. In these cases,
> the recovery code will also call raid5_set_cache_size to increase
> stripe cache size.
>
> At the end of scan, the recovery code replays all Data-Parity
> stripes, and sets proper states for Data-Only stripes. The recovery
> code also increases seq number by 10 and rewrites all Data-Only
> stripes to journal. This is to avoid confusion after repeated
> crashes. More details is explained in raid5-cache.c before
> r5c_recovery_rewrite_data_only_stripes().
>
> Signed-off-by: Song Liu <songliubraving@fb.com>


This patch seems to do a number of different things.
I think it re-factors the journal reading code.
It adds code to write a new "empty" journal metadata block
and it adds support for recovery of cached data.

All this together makes it hard to review.  I'd rather more smaller
patches.


> +	/* stripes only have parity are already flushed to RAID */
> +	if (data_count == 0)
> +		goto out;

Can you explain why that is?  When were they flushed to the RAID, and
how was the parity determined?


> +
> +static void
> +r5l_recovery_create_emtpy_meta_block(struct r5l_log *log,

"empty"

> +				     struct page *page,
> +				     sector_t pos, u64 seq)

> +/* returns 0 for match; 1 for mismtach */

No, please don't do that.
You can return an negative error if you like, and call it as
   function_name() < 0
 or
   function_name() == 0

or give the function a name that describes what it tests
i.e.
   r5l_data_checksum_is_correct()
or
   r5l_data_checksum_does_not_match()

and then return 0 if the test fails and 1 if it succeeds.

> +static int
> +r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
> +				  sector_t log_offset, __le32 log_checksum)



Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v2] Fix bus error when accessing MBR partition records
From: Jes Sorensen @ 2016-10-19 16:33 UTC (permalink / raw)
  To: James Clarke; +Cc: linux-raid
In-Reply-To: <20161017201601.93258-1-jrtc27@jrtc27.com>

James Clarke <jrtc27@jrtc27.com> writes:
> Since the MBR layout only has partition records as 2-byte aligned, the 32-bit
> fields in them are not aligned. Thus, they cannot be accessed on some
> architectures (such as SPARC) by using a "struct MBR_part_record *" pointer,
> as the compiler can assume that the pointer is properly aligned. Instead, the
> records must be accessed by going through the MBR struct itself every time.
>
> Signed-off-by: James Clarke <jrtc27@jrtc27.com>
> ---
> Changes from v1:
>
>  * Added packed attribute to MBR_part_record
>
>  * Reformatted changes to fit the 80-character line limit (assuming a tab stop
>    of 4; if it's 8 I can try and cut the line lengths down, though with the
>    extra verbosity that's going to be more awkward...)

I am going to clean this up - a tab is always 8 characters just like
with the kernel.

Jes

>
>  part.h      |  2 +-
>  super-mbr.c |  6 ++++++
>  util.c      | 15 ++++++++-------
>  3 files changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/part.h b/part.h
> index 862a14c..e697fb4 100644
> --- a/part.h
> +++ b/part.h
> @@ -40,7 +40,7 @@ struct MBR_part_record {
>    __u8 last_cyl;
>    __u32 first_sect_lba;
>    __u32 blocks_num;
> -};
> +} __attribute__((packed));
>  
>  struct MBR {
>  	__u8 pad[446];
> diff --git a/super-mbr.c b/super-mbr.c
> index 62b3f03..303dde4 100644
> --- a/super-mbr.c
> +++ b/super-mbr.c
> @@ -57,6 +57,9 @@ static void examine_mbr(struct supertype *st, char *homehost)
>  
>  	printf("   MBR Magic : %04x\n", sb->magic);
>  	for (i = 0; i < MBR_PARTITIONS; i++)
> +		/* Have to make every access through sb rather than using a pointer to
> +		 * the partition table (or an entry), since the entries are not
> +		 * properly aligned. */
>  		if (sb->parts[i].blocks_num)
>  			printf("Partition[%d] : %12lu sectors at %12lu (type %02x)\n",
>  			       i,
> @@ -151,6 +154,9 @@ static void getinfo_mbr(struct supertype *st, struct mdinfo *info, char *map)
>  	info->component_size = 0;
>  
>  	for (i = 0; i < MBR_PARTITIONS ; i++)
> +		/* Have to make every access through sb rather than using a pointer to
> +		 * the partition table (or an entry), since the entries are not
> +		 * properly aligned. */
>  		if (sb->parts[i].blocks_num) {
>  			unsigned long last =
>  				(unsigned long)__le32_to_cpu(sb->parts[i].blocks_num)
> diff --git a/util.c b/util.c
> index a238a21..4a8767b 100644
> --- a/util.c
> +++ b/util.c
> @@ -1412,7 +1412,6 @@ static int get_gpt_last_partition_end(int fd, unsigned long long *endofpart)
>  static int get_last_partition_end(int fd, unsigned long long *endofpart)
>  {
>  	struct MBR boot_sect;
> -	struct MBR_part_record *part;
>  	unsigned long long curr_part_end;
>  	unsigned part_nr;
>  	int retval = 0;
> @@ -1429,21 +1428,23 @@ static int get_last_partition_end(int fd, unsigned long long *endofpart)
>  	if (boot_sect.magic == MBR_SIGNATURE_MAGIC) {
>  		retval = 1;
>  		/* found the correct signature */
> -		part = boot_sect.parts;
>  
>  		for (part_nr = 0; part_nr < MBR_PARTITIONS; part_nr++) {
> +			/* Have to make every access through boot_sect rather than using a
> +			 * pointer to the partition table (or an entry), since the entries
> +			 * are not properly aligned. */
> +
>  			/* check for GPT type */
> -			if (part->part_type == MBR_GPT_PARTITION_TYPE) {
> +			if (boot_sect.parts[part_nr].part_type == MBR_GPT_PARTITION_TYPE) {
>  				retval = get_gpt_last_partition_end(fd, endofpart);
>  				break;
>  			}
>  			/* check the last used lba for the current partition  */
> -			curr_part_end = __le32_to_cpu(part->first_sect_lba) +
> -				__le32_to_cpu(part->blocks_num);
> +			curr_part_end =
> +				__le32_to_cpu(boot_sect.parts[part_nr].first_sect_lba) +
> +				__le32_to_cpu(boot_sect.parts[part_nr].blocks_num);
>  			if (curr_part_end > *endofpart)
>  				*endofpart = curr_part_end;
> -
> -			part++;
>  		}
>  	} else {
>  		/* Unknown partition table */

^ permalink raw reply

* Re: [PATCH] Allow level migration only for single-array container
From: Jes Sorensen @ 2016-10-19 15:26 UTC (permalink / raw)
  To: Mariusz Dabrowski
  Cc: linux-raid, tomasz.majchrzak, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt
In-Reply-To: <1476275382-1566-1-git-send-email-mariusz.dabrowski@intel.com>

Mariusz Dabrowski <mariusz.dabrowski@intel.com> writes:
> IMSM doesn't allow to change RAID level of array in container with two
> arrays but array count check is being done too late (after removing disks)
> and in some cases (e. g. RAID 0 and RAID 1 migrated to RAID 0) both arrays
> become degraded. This patch adds array count check before disks are being
> removed.
>
> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
> ---
>  Grow.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/Grow.c b/Grow.c
> index 628f0e7..bcd27f5 100755
> --- a/Grow.c
> +++ b/Grow.c
> @@ -777,6 +777,25 @@ int remove_disks_for_takeover(struct supertype *st,
>  	struct mdinfo *remaining;
>  	int slot;
>  
> +	if (st->ss->external) {
> +		int rv = 0;
> +		struct mdinfo *arrays = st->ss->container_content(st, NULL);
> +		/* containter_content returns list of arrays in container
> +		 * If arrays->next is not NULL it means that there are 2 arrays in
> +		 * container and operation should be blocked
> +		 */
> +		if (arrays) {
> +			if (arrays->next)
> +				rv = 1;
> +			sysfs_free(arrays);
> +			if (rv) {
> +				pr_err("Error. Cannot perform operation on /dev/%s\n", st->devnm);
> +				pr_err("For this operation it MUST be single array in container\n");
> +				return rv;
> +			}
> +		}
> +	}
> +

Applied, but do remember code is 80 characters wide - except for print
statements. I will fixup that comment line as part of applying it.

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH] imsm: block chunk size change for RAID 10
From: Jes Sorensen @ 2016-10-19 15:23 UTC (permalink / raw)
  To: Mariusz Dabrowski
  Cc: linux-raid, tomasz.majchrzak, aleksey.obitotskiy, pawel.baldysiak,
	artur.paszkiewicz, maksymilian.kunt
In-Reply-To: <1476275322-1251-1-git-send-email-mariusz.dabrowski@intel.com>

Mariusz Dabrowski <mariusz.dabrowski@intel.com> writes:
> Chunk size change of RAID 10 array fails because it is not supported but
> invalid values still are being written to metadata and array cannot be
> assembled after stop. Operation should be blocked before metadata update.
>
> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
> ---
>  super-intel.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/super-intel.c b/super-intel.c
> index 92817e9..0b3b2b1 100644
> --- a/super-intel.c
> +++ b/super-intel.c
> @@ -10074,10 +10074,16 @@ enum imsm_reshape_type imsm_analyze_change(struct supertype *st,
>  	}
>  
>  	if ((geo->chunksize > 0) && (geo->chunksize != UnSet)
> -	    && (geo->chunksize != info.array.chunk_size))
> +	    && (geo->chunksize != info.array.chunk_size)) {
> +		if (info.array.level == 10) {
> +			pr_err("Error. Chunk size change for RAID 10 is not supported.\n");
> +			change = -1;
> +			goto analyse_change_exit;
> +		}
>  		change = CH_MIGRATION;
> -	else
> +	} else {
>  		geo->chunksize = info.array.chunk_size;
> +	}
>  
>  	chunk = geo->chunksize / 1024;

Applied, however I am going to go in and fix up the broken formatting of
that if statement as well.

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH] super1: make write_bitmap1 compatible with previous mdadm versions
From: Jes Sorensen @ 2016-10-19 15:21 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, Neil Brown
In-Reply-To: <1476253447-19865-1-git-send-email-gqjiang@suse.com>

Guoqing Jiang <gqjiang@suse.com> writes:
> For older mdadm version, v1.x metadata has different bitmap_offset,
> we can't ensure all the bitmaps are on a 4K boundary since writing
> 4K for bitmap could corrupt the superblock, and Anthony reported
> the bug about it at below link.
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=837964
>
> So let's check about the alignment for bitmap_offset before set
> the boundary to 4096 unconditionally. Thanks for Neil's detailed
> explanation.
>
> Reported-by: Anthony DeRobertis <anthony@derobert.net>
> Fixes: 95a05b37e8eb ("Create n bitmaps for clustered mode")
> Cc: Neil Brown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  super1.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH] badblocks: fix overlapping check for clearing
From: NeilBrown @ 2016-10-19  2:36 UTC (permalink / raw)
  To: Tomasz Majchrzak, Dan Williams, linux-block; +Cc: linux-raid
In-Reply-To: <20161012102621.GA30839@proton.igk.intel.com>

[-- Attachment #1: Type: text/plain, Size: 3178 bytes --]

On Wed, Oct 12 2016, Tomasz Majchrzak wrote:

> On Mon, Oct 10, 2016 at 03:32:58PM -0700, Dan Williams wrote:
>> > On Tue, Sep 06 2016, Tomasz Majchrzak wrote:
>> >> ---
>> >>  block/badblocks.c | 6 ++++--
>> >>  1 file changed, 4 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/block/badblocks.c b/block/badblocks.c
>> >> index 7be53cb..b2ffcc7 100644
>> >> --- a/block/badblocks.c
>> >> +++ b/block/badblocks.c
>> >> @@ -354,7 +354,8 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
>> >>                * current range.  Earlier ranges could also overlap,
>> >>                * but only this one can overlap the end of the range.
>> >>                */
>> >> -             if (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) {
>> >> +             if ((BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > target) &&
>> >> +                 (BB_OFFSET(p[lo]) <= target)) {
>> >
>> > hmmm..
>> > 'target' is the sector just beyond the set of sectors to remove from the
>> > list.
>> > BB_OFFSET(p[lo]) is the first sector in a range that was found in the
>> > list.
>> > If these are equal, then are aren't clearing anything in this range.
>> > So I would have '<', not '<='.
>> >
>> > I don't think this makes the code wrong as we end up assigning to p[lo]
>> > the value that is already there.  But it might be confusing.
>> >
>> >
>> >>                       /* Partial overlap, leave the tail of this range */
>> >>                       int ack = BB_ACK(p[lo]);
>> >>                       sector_t a = BB_OFFSET(p[lo]);
>> >> @@ -377,7 +378,8 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
>> >>                       lo--;
>> >>               }
>> >>               while (lo >= 0 &&
>> >> -                    BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
>> >> +                    (BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) &&
>> >> +                    (BB_OFFSET(p[lo]) <= target)) {
>> >
>> > Ditto.
>> >
>> > But the code is, I think, correct. Just not how I would have written it.
>> > So
>> >
>> >  Acked-by: NeilBrown <neilb@suse.com>
>> 
>> I agree with the comments to change "<=" to "<".  Tomasz, care to
>> re-send with those changes?
>
> I have just resent the patch with your suggestions included.
>
>> > In the original md context, it would only ever be called on a block that
>> > was already in the list.
>
> Actually MD RAID10 calls it this way. See handle_write_completed, it iterates
> over all copies and clears the bad block if error has not been returned. I have
> a test case which fails for that reason - existing bad block is modified by
> clear block. It is very unlikely to happen in real life as it depends on
> specific layout of bad blocks and their discovery order, however it's a gap that
> needs to be closed.

Ahh, I didn't realize that.  I see that you are correct though.

>
> I had put some effort to see if clearing of non-existing bad block in RAID10 can
> lead to some incorrect behaviour but I haven't found any. It seems that my patch
> is sufficient to fix the problem.

Yes.  Thanks for a lot for sorting this out :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v5 7/8] md/r5cache: r5c recovery
From: NeilBrown @ 2016-10-19  2:32 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161013054944.1038806-8-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 1748 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> This is the recovery part of raid5-cache.
>
> With cache feature, there are 2 different scenarios of recovery:
> 1. Data-Parity stripe: a stripe with complete parity in journal.
> 2. Data-Only stripe: a stripe with only data in journal (or partial
>    parity).
>
> The code differentiate Data-Parity stripe from Data-Only stripe with
> flag (STRIPE_R5C_WRITTEN).
>
> For Data-Parity stripes, we use the same procedure as raid5 journal,
> where all the data and parity are replayed to the RAID devices.
>
> For Data-Only strips, we need to finish complete calculate parity and
> finish the full reconstruct write or RMW write. For simplicity, in
> the recovery, we load the stripe to stripe cache. Once the array is
> started, the stripe cache state machine will handle these stripes
> through normal write path.
>
> r5c_recovery_flush_log contains the main procedure of recovery. The
> recovery code first scans through the journal and loads data to
> stripe cache. The code keeps tracks of all these stripes in a list
> (use sh->lru and ctx->cached_list), stripes in the list are
> organized in the order of its first appearance on the journal.
> During the scan, the recovery code assesses each stripe as
> Data-Parity or Data-Only.
>
> During scan, the array may run out of stripe cache. In these cases,
> the recovery code will also call raid5_set_cache_size to increase
> stripe cache size.

What if this fails.  Maybe the array was created on a machine with lots
or memory, but that machine died and you are trying to recovery you data
on a much less capable machine.

I don't think there is an easy answer, but at least you need to fail
gracefully.

I'll try to look at the rest tomorrow.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state
From: NeilBrown @ 2016-10-19  2:19 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161013054944.1038806-7-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 726 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> r5c_state have 4 states:
> * no-cache;
> * write-through (write journal only);
> * write-back (w/ write cache);
> * cache-broken (journal missing or Faulty)

I don't like this.

We already have a mechanism for removing a failed device from
an array: write 'remove' to the 'state' file in the relevant dev-*
subdirectory.
You can also use that file to tell if the journal has failed.

Please call the file something a little more obvious that 'r5c_state',
maybe 'journal_mode', and use it only to switch between write-through
and write-back.
If there is no journal, then either remove the file, or have writes fail
and reads return something obvious ... maybe ENODEV.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v5 5/8] md/r5cache: reclaim support
From: NeilBrown @ 2016-10-19  2:03 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu
In-Reply-To: <20161013054944.1038806-6-songliubraving@fb.com>

[-- Attachment #1: Type: text/plain, Size: 7199 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> There are two limited resources, stripe cache and journal disk space.
> For better performance, we priotize reclaim of full stripe writes.
> To free up more journal space, we free earliest data on the journal.
>
> In current implementation, reclaim happens when:
> 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 5 seconds) reclaim
>    if there is no reclaim in the past 5 seconds.

5 seconds is an arbitrary number.  I don't think it needs to be made
configurable, but it might help to justify it.
I would probably make it closer to 30 seconds.  There really isn't any
rush.  If there is much load, it will never wait this long.  If there is
not much load ... then it probably doesn't matter.


> 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (32) cached full stripes
>    (r5c_check_cached_full_stripe)

This is another arbitrary number, but I think it needs more
justification.  Why 32?  That's 128K per device, which isn't very much.
It might make sense to set the batch size based on the stripe size
(e.g. at least on stripe?) or on the cache size.

> 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
> 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
>
> r5c_do_reclaim() contains new logic of reclaim.
>
> For stripe cache:
>
> When stripe cache pressure is high (more than 3/4 stripes are cached,

You say "3/4" here bit I think the code says "1/2"
> +	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
But there is also
> +	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
> +	    atomic_read(&conf->empty_inactive_list_nr) > 0)

so maybe you do mean 3/4..  Maybe some comments closer to the different
fractions would help.

> @@ -141,6 +150,12 @@ struct r5l_log {
>  
>  	/* for r5c_cache */
>  	enum r5c_state r5c_state;
> +
> +	/* all stripes in r5cache, in the order of seq at sh->log_start */
> +	struct list_head stripe_in_cache_list;
> +
> +	spinlock_t stripe_in_cache_lock;

You use the _irqsave / _irqrestore versions of spinlock for this lock,
but I cannot see where it is takes from interrupt-context.
What did I miss?

> +/* evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL */
> +static inline void r5c_update_log_state(struct r5l_log *log)
> +{
> +	struct r5conf *conf = log->rdev->mddev->private;
> +	sector_t free_space = r5l_ring_distance(log, log->log_start,
> +						log->last_checkpoint);
> +	sector_t reclaim_space = r5c_log_required_to_flush_cache(conf);
> +
> +	if (!r5c_is_writeback(log))
> +		return;
> +	else if (free_space < 2 * reclaim_space) {
> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	} else if (free_space < 3 * reclaim_space) {
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	} else if (free_space > 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);

Hmmm...
We set LOG_CRITICAL if free_space < 2*reclaim_space, else we clear it.
We set LOG_TIGHT if free_space < 3*reclaim_space and clear it if > 4*reclaim_space

Would the code be clearer if you just made that explicit.

At the very least, change
> +	} else if (free_space > 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);

to

> +	} else if (free_space < 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +     }

so the order of the branches is consistent and all the tests a '<'.

>  
> +/*
> + * do not release a stripe to cached lists in quiesce
> + */

You tell me what this function doesn't do, but not what it does.

This function is only called once, and I think that could would be
clearer if this was just included in-line in that one place.

> +void r5c_prepare_stripe_for_release_in_quiesce(struct stripe_head *sh)
> +{
> +	if (!test_bit(STRIPE_HANDLE, &sh->state) &&
> +	    atomic_read(&sh->dev_in_cache) != 0) {
> +		if (!test_bit(STRIPE_R5C_FROZEN, &sh->state))
> +			r5c_freeze_stripe_for_reclaim(sh);
> +		set_bit(STRIPE_HANDLE, &sh->state);
> +	}
> +}
> +
>  
> @@ -709,6 +867,7 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
>  	while (!list_empty(&log->no_space_stripes)) {
>  		sh = list_first_entry(&log->no_space_stripes,
>  				      struct stripe_head, log_list);
> +		set_bit(STRIPE_PREREAD_ACTIVE, &sh->state);
>  		list_del_init(&sh->log_list);
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		raid5_release_stripe(sh);

This chunk doesn't seem to belong here.  It looks like it might be a
bugfix that is quite independent of the rest of the code.  Is it?
If is, it probably belongs in a separate patch.


>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
> +	struct r5conf *conf = log->rdev->mddev->private;
>  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
>  	sector_t reclaimable;
>  	sector_t next_checkpoint;
> -	u64 next_cp_seq;
>  
>  	spin_lock_irq(&log->io_list_lock);
>  	/*
> @@ -924,8 +1115,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  				    log->io_list_lock);
>  	}
>  
> -	next_checkpoint = log->next_checkpoint;
> -	next_cp_seq = log->next_cp_seq;
> +	next_checkpoint = r5c_calculate_new_cp(conf);
>  	spin_unlock_irq(&log->io_list_lock);
>  
>  	BUG_ON(reclaimable < 0);
> @@ -941,7 +1131,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  
>  	mutex_lock(&log->io_mutex);
>  	log->last_checkpoint = next_checkpoint;
> -	log->last_cp_seq = next_cp_seq;

Why don't we update last_cp_seq here any more?


> +
> +/*
> + * if num < 0, flush all stripes
> + * if num == 0, flush all full stripes
> + * if num > 0, flush all full stripes. If less than num full stripes are
> + *             flushed, flush some partial stripes until totally num stripes are
> + *             flushed or there is no more cached stripes.
> + */

I'm not very keen on the idea of having "num < 0".

I'd rather the meaning was:
  flush all full stripes, and a total of at least num stripes.

To flush all stripes, pass MAX_INT for num (or similar).

> +void r5c_flush_cache(struct r5conf *conf, int num)
> +{
> +	int count;
> +	struct stripe_head *sh, *next;
> +
> +	assert_spin_locked(&conf->device_lock);
> +	if (!conf->log)
> +		return;
> +
> +	count = 0;
> +	list_for_each_entry_safe(sh, next, &conf->r5c_full_stripe_list, lru) {
> +		r5c_flush_stripe(conf, sh);
> +		count++;
> +	}
> +
> +	if (num == 0 || (num > 0 && count >= num))
> +		return;
> +	list_for_each_entry_safe(sh, next,
> +				 &conf->r5c_partial_stripe_list, lru) {
> +		r5c_flush_stripe(conf, sh);
> +		if (num > 0 && count++ >= num)
> +			break;
> +	}
> +}
> +


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: [PATCH v5 4/8] md/r5cache: write part of r5cache
From: NeilBrown @ 2016-10-19  0:53 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid@vger.kernel.org, Shaohua Li, Kernel Team,
	dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuang521@gmail.com, liuzhengyuan@kylinos.cn
In-Reply-To: <00D21118-73D2-42D1-8062-D468D0C134B4@fb.com>

[-- Attachment #1: Type: text/plain, Size: 4061 bytes --]

On Fri, Oct 14 2016, Song Liu wrote:

>> On Oct 13, 2016, at 11:53 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Thu, Oct 13 2016, Song Liu wrote:
>>> 
>>> For RMW, the code allocates an extra page for each data block
>>> being updated.  This is stored in r5dev->page and the old data
>>> is read into it.  Then the prexor calculation subtracts ->page
>>> from the parity block, and the reconstruct calculation adds the
>>> ->orig_page data back into the parity block.
>> 
>> What happens if the alloc_page() fails?
>
> That will be tough, but solvable.. We can
>     read old data to page
>     do prexor 
>     read new data from journal device to page
>     do xor 
>     do the rest of the work. 
>
> Or we can force the code to rcw, which does not need extra page. 
> But rcw, does not always work in degraded mode. So, this is a good 
> reason not to do write-back in degraded mode...

Prohibiting write-back in degraded mode would not be enough to ensure
that you can always use rcw.  The array can become degraded after you
make the decision to use caching, and before to need to read old data
for rmw.

I would suggest a small (2 entry?) mempool where each entry in the
mempool holds enough pages to complete an rmw.  Only use the mempool if
an alloc_page() fails.

>> 
>>> +
>>> +void r5c_handle_cached_data_endio(struct r5conf *conf,
>>> +	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
>>> +{
>>> +	int i;
>>> +
>>> +	for (i = sh->disks; i--; ) {
>>> +		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
>>> +		    sh->dev[i].written) {
>> 
>> Is it possible for R5_InCache to be set, but 'written' to be NULL ???
>
> Yes, it is possible. A stripe may go through "write data to journal, return IO" multiple
> times before parity calculation. When it comes here the second time, dev written in the 
> first time will have R5_InCache set, but its written will be NULL. 

OK, that makes sense.
So is it possible for 'written' to be set, but R5_InCache to be clear?
i.e. do we really need to test R5_InCache here?

>>> 
>>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>>> @@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>>> 	io = log->current_io;
>>> 
>>> 	for (i = 0; i < sh->disks; i++) {
>>> -		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
>>> +		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
>>> +		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
>>> 			continue;
>> 
>> If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
>> that were destined for the journal, then this would just be
>> 
>> 		if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))
>> 
>> which would make lots of sense...  Just a thought.
>
> We set R5_Wantwrite in multiple places. If we simplify the code here, we will need to make
> those places aware of journal. I guess that is not ideal either? 

Maybe...
We have so many state flags that I like to be very cautious about adding
more, and to make sure they have a very well defined meaning that
doesn't overlap with other flags too much.
The above code suggests that Wantwrite and Wantcache overlap to some
extent.

Could we discard Wantcache and just use Wantwrite combined with InCache?
Wantwrite means that the block needed to be written to the RAID.
If InCache isn't set, it also needs to be written to the cache (if the
cache is being used).
Then the above code would be
   if (!test_bit(R5_Wantwrite) || test_bit(R5_InCache))
      continue;

which means "if we don't want to write this, or if it is already in the
cache, then nothing to do here".

Maybe.

>
>  
>> 
>>> }
>>> 
>>> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>>> +void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>> 
>> Why are you removing the 'static' here?  You don't call it from any
>> other file.
>
> In next patch, it is called in raid.c.

So remove 'static' in the next patch please.


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply

* Re: MD-RAID: Use seq_putc() in three status functions?
From: SF Markus Elfring @ 2016-10-18 18:20 UTC (permalink / raw)
  To: Hannes Reinecke, linux-raid
  Cc: Bernd Petrovitsch, Christoph Hellwig, Guoqing Jiang, Jens Axboe,
	Joe Perches, Mike Christie, Neil Brown, Shaohua Li,
	Tomasz Majchrzak, LKML, kernel-janitors, kbuild-all, ltp
In-Reply-To: <665cd40a-4562-a015-78c6-12976c12b626@suse.de>

> So back to the original task for you: Show me in the generated output where the benefits are.

I can offer a bit more information for this software development discussion.


The afffected source files can be compiled for the processor architecture "x86_64"
by a tool like "GCC 6.2.1+r239849-1.3" from the software distribution
"openSUSE Tumbleweed" with the following command examples.

git checkout next-20161014 && my_cc=/usr/bin/gcc-6 && make HOSTCC="${my_cc}" allmodconfig && make -j6 HOSTCC="${my_cc}" drivers/md/
git checkout next_usage_of_seq_putc_in_md_raid_1 && my_cc=/usr/bin/gcc-6 && make HOSTCC="${my_cc}" allmodconfig && make -j6 HOSTCC="${my_cc}" drivers/md/


The tool "objdump" from the software package "binutils 2.27-1.3" can be used
to get corresponding disassemblies for a file like "drivers/md/raid1.obj"
which can then be compared as follows.


--- ../disassembly-md-raid1-next-20161014-1.txt	2016-10-18 18:00:12.341222741 +0200
+++ ../disassembly-md-raid1-seq_putc-1.txt	2016-10-18 18:03:54.135887333 +0200
@@ -3349,7 +3349,7 @@
     37ad:	85 c0                	test   %eax,%eax
     37af:	74 0d                	je     37be <raid1_status+0x9e>
     37b1:	80 3d 00 00 00 00 00 	cmpb   $0x0,0x0(%rip)        # 37b8 <raid1_status+0x98>
-    37b8:	0f 84 1d 01 00 00    	je     38db <raid1_status+0x1bb>
+    37b8:	0f 84 1b 01 00 00    	je     38d9 <raid1_status+0x1b9>
     37be:	4c 89 ff             	mov    %r15,%rdi
     37c1:	31 db                	xor    %ebx,%ebx
     37c3:	e8 00 00 00 00       	callq  37c8 <raid1_status+0xa8>
@@ -3404,42 +3404,43 @@
     3891:	85 c0                	test   %eax,%eax
     3893:	74 09                	je     389e <raid1_status+0x17e>
     3895:	80 3d 00 00 00 00 00 	cmpb   $0x0,0x0(%rip)        # 389c <raid1_status+0x17c>
-    389c:	74 6e                	je     390c <raid1_status+0x1ec>
+    389c:	74 6c                	je     390a <raid1_status+0x1ea>
     389e:	48 c7 c2 00 00 00 00 	mov    $0x0,%rdx
     38a5:	be 01 00 00 00       	mov    $0x1,%esi
     38aa:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
     38b1:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 38b8 <raid1_status+0x198>
     38b8:	e8 00 00 00 00       	callq  38bd <raid1_status+0x19d>
     38bd:	4c 89 f7             	mov    %r14,%rdi
-    38c0:	48 c7 c6 00 00 00 00 	mov    $0x0,%rsi
-    38c7:	e8 00 00 00 00       	callq  38cc <raid1_status+0x1ac>
-    38cc:	48 83 c4 10          	add    $0x10,%rsp
-    38d0:	5b                   	pop    %rbx
-    38d1:	41 5c                	pop    %r12
-    38d3:	41 5d                	pop    %r13
-    38d5:	41 5e                	pop    %r14
-    38d7:	41 5f                	pop    %r15
-    38d9:	5d                   	pop    %rbp
-    38da:	c3                   	retq   
-    38db:	e8 00 00 00 00       	callq  38e0 <raid1_status+0x1c0>
-    38e0:	84 c0                	test   %al,%al
-    38e2:	0f 85 d6 fe ff ff    	jne    37be <raid1_status+0x9e>
-    38e8:	48 c7 c2 00 00 00 00 	mov    $0x0,%rdx
-    38ef:	be 69 03 00 00       	mov    $0x369,%esi
-    38f4:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
-    38fb:	c6 05 00 00 00 00 01 	movb   $0x1,0x0(%rip)        # 3902 <raid1_status+0x1e2>
-    3902:	e8 00 00 00 00       	callq  3907 <raid1_status+0x1e7>
-    3907:	e9 b2 fe ff ff       	jmpq   37be <raid1_status+0x9e>
-    390c:	e8 00 00 00 00       	callq  3911 <raid1_status+0x1f1>
-    3911:	84 c0                	test   %al,%al
-    3913:	75 89                	jne    389e <raid1_status+0x17e>
-    3915:	48 c7 c2 00 00 00 00 	mov    $0x0,%rdx
-    391c:	be 9c 03 00 00       	mov    $0x39c,%esi
-    3921:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
-    3928:	c6 05 00 00 00 00 01 	movb   $0x1,0x0(%rip)        # 392f <raid1_status+0x20f>
-    392f:	e8 00 00 00 00       	callq  3934 <raid1_status+0x214>
-    3934:	e9 65 ff ff ff       	jmpq   389e <raid1_status+0x17e>
-    3939:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
+    38c0:	be 5d 00 00 00       	mov    $0x5d,%esi
+    38c5:	e8 00 00 00 00       	callq  38ca <raid1_status+0x1aa>
+    38ca:	48 83 c4 10          	add    $0x10,%rsp
+    38ce:	5b                   	pop    %rbx
+    38cf:	41 5c                	pop    %r12
+    38d1:	41 5d                	pop    %r13
+    38d3:	41 5e                	pop    %r14
+    38d5:	41 5f                	pop    %r15
+    38d7:	5d                   	pop    %rbp
+    38d8:	c3                   	retq   
+    38d9:	e8 00 00 00 00       	callq  38de <raid1_status+0x1be>
+    38de:	84 c0                	test   %al,%al
+    38e0:	0f 85 d8 fe ff ff    	jne    37be <raid1_status+0x9e>
+    38e6:	48 c7 c2 00 00 00 00 	mov    $0x0,%rdx
+    38ed:	be 69 03 00 00       	mov    $0x369,%esi
+    38f2:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
+    38f9:	c6 05 00 00 00 00 01 	movb   $0x1,0x0(%rip)        # 3900 <raid1_status+0x1e0>
+    3900:	e8 00 00 00 00       	callq  3905 <raid1_status+0x1e5>
+    3905:	e9 b4 fe ff ff       	jmpq   37be <raid1_status+0x9e>
+    390a:	e8 00 00 00 00       	callq  390f <raid1_status+0x1ef>
+    390f:	84 c0                	test   %al,%al
+    3911:	75 8b                	jne    389e <raid1_status+0x17e>
+    3913:	48 c7 c2 00 00 00 00 	mov    $0x0,%rdx
+    391a:	be 9c 03 00 00       	mov    $0x39c,%esi
+    391f:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
+    3926:	c6 05 00 00 00 00 01 	movb   $0x1,0x0(%rip)        # 392d <raid1_status+0x20d>
+    392d:	e8 00 00 00 00       	callq  3932 <raid1_status+0x212>
+    3932:	e9 67 ff ff ff       	jmpq   389e <raid1_status+0x17e>
+    3937:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
+    393e:	00 00 
 
 0000000000003940 <print_conf>:
     3940:	e8 00 00 00 00       	callq  3945 <print_conf+0x5>
@@ -11134,7 +11135,7 @@
 
 0000000000000000 <_GLOBAL__sub_D_65535_0_raid1.c>:
    0:	55                   	push   %rbp
-   1:	be 35 00 00 00       	mov    $0x35,%esi
+   1:	be 34 00 00 00       	mov    $0x34,%esi
    6:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
    d:	48 89 e5             	mov    %rsp,%rbp
   10:	e8 00 00 00 00       	callq  15 <_GLOBAL__sub_D_65535_0_raid1.c+0x15>
@@ -11145,7 +11146,7 @@
 
 0000000000000000 <_GLOBAL__sub_I_65535_1_raid1.c>:
    0:	55                   	push   %rbp
-   1:	be 35 00 00 00       	mov    $0x35,%esi
+   1:	be 34 00 00 00       	mov    $0x34,%esi
    6:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
    d:	48 89 e5             	mov    %rsp,%rbp
   10:	e8 00 00 00 00       	callq  15 <_GLOBAL__sub_I_65535_1_raid1.c+0x15>


Does this kind of data display contain differences which are worth for further considerations?

Regards,
Markus

^ permalink raw reply

* Re: dm: do not assign error to md->kworker_task
From: Mike Snitzer @ 2016-10-18 18:11 UTC (permalink / raw)
  To: Tahsin Erdogan
  Cc: Alasdair Kergon, dm-devel, Shaohua Li, linux-raid, linux-kernel
In-Reply-To: <1476753409-16111-1-git-send-email-tahsin@google.com>

On Mon, Oct 17 2016 at  9:16pm -0400,
Tahsin Erdogan <tahsin@google.com> wrote:

> cleanup_mapped_device() calls kthread_stop() if kworker_task is
> non-NULL. Currently the assigned value could be a valid task struct or
> an error code. Do not assign in case of error.
> 
> Example failure when kthread_run() returns -ENOMEM:
> 
> [   22.255939] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
> [   22.258847] IP: [<ffffffff802973a4>] kthread_stop+0x34/0x260
> [   22.260130] PGD 78a23067 PUD 78b56067 PMD 0
> [   22.260130] Oops: 0002 [#1] SMP
> [   22.260130] CPU: 1 PID: 1849 Comm: dmsetup Tainted: G        W 4.8.0+ #3
> [   22.260130] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [   22.260130] task: ffff880078966400 task.stack: ffffc90001898000
> [   22.260130] RIP: 0010:[<ffffffff802973a4>]  [<ffffffff802973a4>] kthread_stop+0x34/0x260
> [   22.260130] RSP: 0018:ffffc9000189bc40  EFLAGS: 00010202
> [   22.260130] RAX: 0000000000000001 RBX: fffffffffffffff4 RCX: 0000000000000003
> [   22.260130] RDX: ffff88007fd18600 RSI: 0000000000000001 RDI: ffffffff81037080
> [   22.260130] RBP: ffffc9000189bc50 R08: 0000000000000000 R09: 0000000000000000
> [   22.260130] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [   22.260130] R13: 0000000000000001 R14: ffff880077f539d8 R15: 0000000000000004
> [   22.260130] FS:  00007fc9ef2e2840(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
> [   22.260130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   22.260130] CR2: 000000000000000c CR3: 0000000077fa2000 CR4: 00000000000006e0
> [   22.260130] Stack:
> [   22.260130]  ffff880077f53800 0000000000000000 ffffc9000189bc68 ffffffff808b26fa
> [   22.260130]  ffff880077f53800 ffffc9000189bcb0 ffffffff808b3c58 0000000000000000
> [   22.260130]  00000000808b534b ffffc9000189bd20 ffff880077f53800 0000000000000000
> [   22.260130] Call Trace:
> [   22.260130]  [<ffffffff808b26fa>] cleanup_mapped_device+0x2a/0xe0
> [   22.260130]  [<ffffffff808b3c58>] __dm_destroy+0x1a8/0x2b0
> [   22.260130]  [<ffffffff808b4b6e>] dm_destroy+0xe/0x10
> [   22.260130]  [<ffffffff808b9f49>] dev_remove+0xd9/0x120
> [   22.260130]  [<ffffffff808b9e70>] ? dev_suspend+0x210/0x210
> [   22.260130]  [<ffffffff808ba576>] ctl_ioctl+0x206/0x500
> [   22.260130]  [<ffffffff808ba87e>] dm_ctl_ioctl+0xe/0x20
> [   22.260130]  [<ffffffff803bca40>] do_vfs_ioctl+0x90/0x6b0
> [   22.260130]  [<ffffffff80b11fd7>] ?  entry_SYSCALL_64_fastpath+0x5/0xad
> [   22.260130]  [<ffffffff802bd974>] ?  trace_hardirqs_on_caller+0xf4/0x1c0
> [   22.260130]  [<ffffffff803bd0d4>] SyS_ioctl+0x74/0x80
> [   22.260130]  [<ffffffff80b11fea>] entry_SYSCALL_64_fastpath+0x18/0xad
> [   22.260130] Code: e5 41 54 85 c0 53 48 89 fb 0f 8f bb 01 00 00 65 8b
> 05 a1 2d d7 7f 89 c0 48 0f a3 05 9f 94 e8 00 0f 92 c0 84 c0 0f 85 a3 00
> 00 00 <f0> ff 43 18 48 89 df e8 10 f8 ff ff 48 85 c0 49 89 c4 74 29 f0
> [   22.260130] RIP  [<ffffffff802973a4>] kthread_stop+0x34/0x260
> [   22.260130]  RSP <ffffc9000189bc40>
> [   22.260130] CR2: 000000000000000c
> [   22.301062] ---[ end trace 22b4f4f62c04f3cf ]---
> 
> Signed-off-by: Tahsin Erdogan <tahsin@google.com>

Thanks for the patch but I elected to fix this issue a slightly
different way, please see this commit staged for 4.9-rcX:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.9&id=937fa62e8a00d0b4bc2c0a40567d7c88ab2b2e8d

(also, your mail reminds me that I _really_ need to fix
get_maintainer.pl to _not_ pull in linux-raid and shli for DM-specific
changes!)

^ permalink raw reply

* [PATCH 3/3 v2] md: unblock array if bad blocks have been acknowledged
From: Tomasz Majchrzak @ 2016-10-18 14:10 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, aleksey.obitotskiy, pawel.baldysiak, artur.paszkiewicz,
	maksymilian.kunt, mariusz.dabrowski, axboe, dan.j.williams,
	linux-block, Tomasz Majchrzak

Once external metadata handler acknowledges all bad blocks (by writing
to rdev 'bad_blocks' sysfs file), it requests to unblock the array.
Check if all bad blocks are actually acknowledged as there might be a
race if new bad blocks are notified at the same time. If all bad blocks
are acknowledged, just unblock the array and continue. If not, ignore
the request to unblock (do not fail an array). External metadata handler
is expected to either process remaining bad blocks and try to unblock
again or remove bad block support for a disk (which will cause disk to
fail as in no-support case).

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index cc05236..ce585b7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2612,19 +2612,29 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 		set_bit(Blocked, &rdev->flags);
 		err = 0;
 	} else if (cmd_match(buf, "-blocked")) {
-		if (!test_bit(Faulty, &rdev->flags) &&
+		int unblock = 1;
+		int acked = !rdev->badblocks.unacked_exist;
+
+		if ((test_bit(ExternalBbl, &rdev->flags) &&
+		     rdev->badblocks.changed))
+			acked = check_if_badblocks_acked(&rdev->badblocks);
+
+		if (test_bit(ExternalBbl, &rdev->flags) && !acked) {
+			unblock = 0;
+		} else if (!test_bit(Faulty, &rdev->flags) &&
 		    rdev->badblocks.unacked_exist) {
 			/* metadata handler doesn't understand badblocks,
 			 * so we need to fail the device
 			 */
 			md_error(rdev->mddev, rdev);
 		}
-		clear_bit(Blocked, &rdev->flags);
-		clear_bit(BlockedBadBlocks, &rdev->flags);
-		wake_up(&rdev->blocked_wait);
-		set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
-		md_wakeup_thread(rdev->mddev->thread);
-
+		if (unblock) {
+			clear_bit(Blocked, &rdev->flags);
+			clear_bit(BlockedBadBlocks, &rdev->flags);
+			wake_up(&rdev->blocked_wait);
+			set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
+			md_wakeup_thread(rdev->mddev->thread);
+		}
 		err = 0;
 	} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
 		set_bit(In_sync, &rdev->flags);
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 2/3 v2] md: add bad block support for external metadata
From: Tomasz Majchrzak @ 2016-10-18 14:09 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, aleksey.obitotskiy, pawel.baldysiak, artur.paszkiewicz,
	maksymilian.kunt, mariusz.dabrowski, axboe, dan.j.williams,
	linux-block, Tomasz Majchrzak

Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.

When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.

Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c | 78 ++++++++++++++++++++++++++++-----------------------------
 drivers/md/md.h |  3 +++
 2 files changed, 42 insertions(+), 39 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index eac84d8..cc05236 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2523,51 +2523,38 @@ struct rdev_sysfs_entry {
 static ssize_t
 state_show(struct md_rdev *rdev, char *page)
 {
-	char *sep = "";
+	char *sep = ",";
 	size_t len = 0;
 	unsigned long flags = ACCESS_ONCE(rdev->flags);
 
 	if (test_bit(Faulty, &flags) ||
-	    rdev->badblocks.unacked_exist) {
-		len+= sprintf(page+len, "%sfaulty",sep);
-		sep = ",";
-	}
-	if (test_bit(In_sync, &flags)) {
-		len += sprintf(page+len, "%sin_sync",sep);
-		sep = ",";
-	}
-	if (test_bit(Journal, &flags)) {
-		len += sprintf(page+len, "%sjournal",sep);
-		sep = ",";
-	}
-	if (test_bit(WriteMostly, &flags)) {
-		len += sprintf(page+len, "%swrite_mostly",sep);
-		sep = ",";
-	}
+	    rdev->badblocks.unacked_exist)
+		len += sprintf(page+len, "faulty%s", sep);
+	if (test_bit(In_sync, &flags))
+		len += sprintf(page+len, "in_sync%s", sep);
+	if (test_bit(Journal, &flags))
+		len += sprintf(page+len, "journal%s", sep);
+	if (test_bit(WriteMostly, &flags))
+		len += sprintf(page+len, "write_mostly%s", sep);
 	if (test_bit(Blocked, &flags) ||
 	    (rdev->badblocks.unacked_exist
-	     && !test_bit(Faulty, &flags))) {
-		len += sprintf(page+len, "%sblocked", sep);
-		sep = ",";
-	}
+	     && !test_bit(Faulty, &flags)))
+		len += sprintf(page+len, "blocked%s", sep);
 	if (!test_bit(Faulty, &flags) &&
 	    !test_bit(Journal, &flags) &&
-	    !test_bit(In_sync, &flags)) {
-		len += sprintf(page+len, "%sspare", sep);
-		sep = ",";
-	}
-	if (test_bit(WriteErrorSeen, &flags)) {
-		len += sprintf(page+len, "%swrite_error", sep);
-		sep = ",";
-	}
-	if (test_bit(WantReplacement, &flags)) {
-		len += sprintf(page+len, "%swant_replacement", sep);
-		sep = ",";
-	}
-	if (test_bit(Replacement, &flags)) {
-		len += sprintf(page+len, "%sreplacement", sep);
-		sep = ",";
-	}
+	    !test_bit(In_sync, &flags))
+		len += sprintf(page+len, "spare%s", sep);
+	if (test_bit(WriteErrorSeen, &flags))
+		len += sprintf(page+len, "write_error%s", sep);
+	if (test_bit(WantReplacement, &flags))
+		len += sprintf(page+len, "want_replacement%s", sep);
+	if (test_bit(Replacement, &flags))
+		len += sprintf(page+len, "replacement%s", sep);
+	if (test_bit(ExternalBbl, &flags))
+		len += sprintf(page+len, "external_bbl%s", sep);
+
+	if (len)
+		len -= strlen(sep);
 
 	return len+sprintf(page+len, "\n");
 }
@@ -2708,6 +2695,13 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 			}
 		} else
 			err = -EBUSY;
+	} else if (cmd_match(buf, "external_bbl") && (rdev->mddev->external)) {
+		set_bit(ExternalBbl, &rdev->flags);
+		rdev->badblocks.shift = 0;
+		err = 0;
+	} else if (cmd_match(buf, "-external_bbl") && (rdev->mddev->external)) {
+		clear_bit(ExternalBbl, &rdev->flags);
+		err = 0;
 	}
 	if (!err)
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
@@ -8614,6 +8608,9 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 	rv = badblocks_set(&rdev->badblocks, s, sectors, 0);
 	if (rv == 0) {
 		/* Make sure they get written out promptly */
+		if (test_bit(ExternalBbl, &rdev->flags))
+			sysfs_notify(&rdev->kobj, NULL,
+				     "unacknowledged_bad_blocks");
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
 		set_mask_bits(&mddev->flags, 0,
 			      BIT(MD_CHANGE_CLEAN) | BIT(MD_CHANGE_PENDING));
@@ -8627,12 +8624,15 @@ EXPORT_SYMBOL_GPL(rdev_set_badblocks);
 int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 			 int is_new)
 {
+	int rv;
 	if (is_new)
 		s += rdev->new_data_offset;
 	else
 		s += rdev->data_offset;
-	return badblocks_clear(&rdev->badblocks,
-				  s, sectors);
+	rv = badblocks_clear(&rdev->badblocks, s, sectors);
+	if ((rv == 0) && test_bit(ExternalBbl, &rdev->flags))
+		sysfs_notify(&rdev->kobj, NULL, "bad_blocks");
+	return rv;
 }
 EXPORT_SYMBOL_GPL(rdev_clear_badblocks);
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2b20417..21bd94f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -168,6 +168,9 @@ enum flag_bits {
 				 * so it is safe to remove without
 				 * another synchronize_rcu() call.
 				 */
+	ExternalBbl,            /* External metadata provides bad
+				 * block management for a disk
+				 */
 };
 
 static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH 1/3 v2] badblocks: add function to check for unacked badblocks
From: Tomasz Majchrzak @ 2016-10-18 14:08 UTC (permalink / raw)
  To: linux-block
  Cc: shli, aleksey.obitotskiy, pawel.baldysiak, artur.paszkiewicz,
	maksymilian.kunt, mariusz.dabrowski, axboe, dan.j.williams,
	linux-raid, Tomasz Majchrzak

Add new function that checks if unacknowledged badblocks exist in a list
and clears ->unacked_exists flag if possible. It is required for storing
badblock information by userspace metadata handlers. Similar function
ack_all_badblocks is of no use in this case as userspace reads badblocks
data from sysfs file so it cannot be sure if the list available in sysfs
at the moment is complete (potential race with other badblocks reported
at the same time).

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 block/badblocks.c         | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/badblocks.h |  1 +
 2 files changed, 37 insertions(+)

diff --git a/block/badblocks.c b/block/badblocks.c
index 7be53cb..1f4a193 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -439,6 +439,42 @@ void ack_all_badblocks(struct badblocks *bb)
 EXPORT_SYMBOL_GPL(ack_all_badblocks);
 
 /**
+ * check_if_badblocks_acked() - Check if all badblocks in list are acknowledged
+ * @bb:		the badblocks structure that holds all badblock information
+ *
+ * It clears ->changed and ->unacked_exist if successful. It is used by
+ * userspace metadata updates
+ *
+ *  Return:
+ *   True if all badblocks are acknowledged, false otherwise
+ */
+int check_if_badblocks_acked(struct badblocks *bb)
+{
+	int acked = 1;
+
+	write_seqlock_irq(&bb->lock);
+	if (bb->unacked_exist) {
+		u64 *p = bb->page;
+		int i;
+
+		for (i = 0; i < bb->count ; i++) {
+			if (!BB_ACK(p[i])) {
+				acked = 0;
+				break;
+			}
+		}
+		if (acked) {
+			bb->unacked_exist = 0;
+			bb->changed = 0;
+		}
+	}
+	write_sequnlock_irq(&bb->lock);
+
+	return acked;
+}
+EXPORT_SYMBOL_GPL(check_if_badblocks_acked);
+
+/**
  * badblocks_show() - sysfs access to bad-blocks list
  * @bb:		the badblocks structure that holds all badblock information
  * @page:	buffer received from sysfs
diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h
index c3bdf8c..6569d4e 100644
--- a/include/linux/badblocks.h
+++ b/include/linux/badblocks.h
@@ -46,6 +46,7 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
 			int acknowledged);
 int badblocks_clear(struct badblocks *bb, sector_t s, int sectors);
 void ack_all_badblocks(struct badblocks *bb);
+int check_if_badblocks_acked(struct badblocks *bb);
 ssize_t badblocks_show(struct badblocks *bb, char *page, int unack);
 ssize_t badblocks_store(struct badblocks *bb, const char *page, size_t len,
 			int unack);
-- 
1.8.3.1


^ permalink raw reply related

* Re: Superblock of raid5 log can't be updated when array stoped
From: Zhengyuan Liu @ 2016-10-18  5:51 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, Song Liu, linux-raid
In-Reply-To: <20161018002814.GC106864@kernel.org>

If that's the problem, I think there is another problem about next_checkpoint initialization.
No initial operation was done to this field when we loaded/recovery  the log , it got
assignment  only when IO to raid disk was finished.  So r5l_quiesce may use wrong 
next_checkpoint to reclaim log space, that would confuse log recovery.  Bellow patch
may help the expression.

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 1b1ab4a..998ea00 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1096,6 +1096,8 @@ static int r5l_recovery_log(struct r5l_log *log)
                log->seq = ctx.seq + 11;
                log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                r5l_write_super(log, ctx.pos);
+               log->last_checkpoint = ctx.pos;
+               log->next_checkpoint = ctx.pos;
        } else {
                log->log_start = ctx.pos;
                log->seq = ctx.seq;
@@ -1168,6 +1170,7 @@ create:
        if (log->max_free_space > RECLAIM_MAX_FREE_SPACE)
                log->max_free_space = RECLAIM_MAX_FREE_SPACE;
        log->last_checkpoint = cp;
+       log->next_checkpoint = cp;
 
        __free_page(page);

Thanks,
--Zhengyuan

------------------ Original ------------------
From:  "Shaohua Li"<shli@kernel.org>;
Date:  Tue, Oct 18, 2016 08:28 AM
To:  "Zhengyuan Liu"<liuzhengyuan@kylinos.cn>;
Cc:  "shli"<shli@fb.com>; "Song Liu"<songliubraving@fb.com>; "linux-raid"<linux-raid@vger.kernel.org>;
Subject:  Re: Superblock of raid5 log can't be updated when array stoped
 
On Sat, Oct 15, 2016 at 10:19:36AM +0800, liuzhengyuan wrote:
> Hi, Shaohua.
> 
> when we stop raid5 array with "mdadm -S" or reboot the system,  md module will 
> call raid5_quiesce and r5l_quiesce to do some clean work. Some code of r5l_quiesce
> was pasted bellow.
> 
>                 /* make sure r5l_write_super_and_discard_space exits */
>                 mddev = log->rdev->mddev;
>                 wake_up(&mddev->sb_wait);
>                 r5l_wake_reclaim(log, -1L);
>                 md_unregister_thread(&log->reclaim_thread);
>                 r5l_do_reclaim(log);
> +                md_update_sb(mddev, 1);
> 
> It will reclaim all used space of log and call r5l_write_super to reset rdev->journal_tail
> to log->next_checkpoint . However, new rdev->journal_tail would not be written to 
> journal device for persistent because journal device may not support discard operation
> or due to mddev_trylock fail (this trylock should always get failed since raid5_quiesce 
> was called with  reconfig_mutex hold, isn't it?). As a result, it will take a long time to
>  recovery the log when the arrary was restarted. Should r5l_quiesce call md_update_sb
>  directly to guarantee  superblock  update?

Yep, that's problem here. Unfortunately we can't call md_update_sb here,
because we might not hold the mddev lock. I think we should call it at do_md_stop.

Thanks,
Shaohua

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox