Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH] dm ioctl: Remove double parentheses
From: Joe Perches @ 2017-04-01  2:07 UTC (permalink / raw)
  To: Matthias Kaehlcke, Alasdair Kergon, Mike Snitzer, dm-devel
  Cc: Michael Davidson, Grant Grundler, linux-kernel, Greg Hackmann,
	linux-raid, Shaohua Li
In-Reply-To: <20170401015019.GA13986@google.com>

On Fri, 2017-03-31 at 18:50 -0700, Matthias Kaehlcke wrote:
> El Thu, Mar 16, 2017 at 09:48:30AM -0700 Matthias Kaehlcke ha dit:
> 
> > The extra pair of parantheses is not needed and causes clang to generate
> > the following warning:
> > 
> > drivers/md/dm-ioctl.c:1776:11: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
> >         if ((cmd == DM_DEV_CREATE_CMD)) {
> >              ~~~~^~~~~~~~~~~~~~~~~~~~
> > drivers/md/dm-ioctl.c:1776:11: note: remove extraneous parentheses around the comparison to silence this warning
> >         if ((cmd == DM_DEV_CREATE_CMD)) {
> >             ~    ^                   ~
> > drivers/md/dm-ioctl.c:1776:11: note: use '=' to turn this equality comparison into an assignment
> >         if ((cmd == DM_DEV_CREATE_CMD)) {

There are dozens of these comparisons in the kernel.
Are you fixing them all or just this one?

^ permalink raw reply

* [PATCH] md/raid10: remove unnecessary checks with 'first'
From: Guoqing Jiang @ 2017-04-01  3:38 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb

Since the value of first is always '1', we can
set min_offset_diff directly.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid10.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0f13d57..edaf8f4 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3702,7 +3702,6 @@ static int raid10_run(struct mddev *mddev)
 	struct md_rdev *rdev;
 	sector_t size;
 	sector_t min_offset_diff = 0;
-	int first = 1;
 	bool discard_supported = false;
 
 	if (mddev->private == NULL) {
@@ -3758,8 +3757,7 @@ static int raid10_run(struct mddev *mddev)
 			diff = -diff;
 		if (diff < 0)
 			diff = 0;
-		if (first || diff < min_offset_diff)
-			min_offset_diff = diff;
+		min_offset_diff = diff;
 
 		if (mddev->gendisk)
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
@@ -4140,7 +4138,6 @@ static int raid10_start_reshape(struct mddev *mddev)
 
 	unsigned long before_length, after_length;
 	sector_t min_offset_diff = 0;
-	int first = 1;
 	struct geom new;
 	struct r10conf *conf = mddev->private;
 	struct md_rdev *rdev;
@@ -4169,8 +4166,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 				diff = -diff;
 			if (diff < 0)
 				diff = 0;
-			if (first || diff < min_offset_diff)
-				min_offset_diff = diff;
+			min_offset_diff = diff;
 		}
 	}
 
-- 
2.6.6


^ permalink raw reply related

* [PATCH 0/4] integrated stat and fstat into utility functions
From: Zhilong Liu @ 2017-04-01 12:51 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu

hi, Jes;

This patches focus on integrating stat and fstat operations, move the
duplicated codes into one utility function, and also fixed one bug in
waitclean(), it forgot to check the block device.

Thanks,
Zhilong

Zhilong Liu (4):
  mdadm/util:integrate stat operations into one utility
  mdadm/util:integrate fstat operations into one utility function
  mdadm/Create:declaring an existing struct within same function
  mdadm/Monitor:check the block device when use waitclean parameter

 Assemble.c    | 15 ++++-----------
 Build.c       | 21 +++------------------
 Create.c      | 17 +++++++----------
 Grow.c        |  4 +---
 Incremental.c | 25 ++-----------------------
 Manage.c      | 10 ++--------
 Monitor.c     |  2 ++
 bitmap.c      | 12 ++++--------
 mdadm.h       |  2 ++
 super-intel.c | 14 +++-----------
 util.c        | 30 ++++++++++++++++++++++++++++++
 11 files changed, 60 insertions(+), 92 deletions(-)

-- 
2.10.2


^ permalink raw reply

* [PATCH 1/4] mdadm/util:integrate stat operations into one utility
From: Zhilong Liu @ 2017-04-01 12:51 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170401125145.14440-1-zlliu@suse.com>

mdadm/util: there are dupilicate codes about stat checking the
block device, move the operations into one utility function and
make it concise.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Assemble.c    |  5 ++---
 Build.c       | 21 +++------------------
 Incremental.c | 12 +-----------
 Manage.c      | 10 ++--------
 mdadm.h       |  1 +
 super-intel.c |  4 +---
 util.c        | 15 +++++++++++++++
 7 files changed, 25 insertions(+), 43 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index 672cd12..c6571e6 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -522,9 +522,8 @@ static int select_devices(struct mddev_dev *devlist,
 		struct stat stb;
 		if (tmpdev->used != 3)
 			continue;
-		if (stat(tmpdev->devname, &stb)< 0) {
-			pr_err("fstat failed for %s: %s\n",
-			       tmpdev->devname, strerror(errno));
+		if (check_blkdev_via_stat(tmpdev->devname) &&
+		    stat(tmpdev->devname, &stb)) {
 			tmpdev->used = 2;
 		} else {
 			struct dev_policy *pol = devid_policy(stb.st_rdev);
diff --git a/Build.c b/Build.c
index 691dd6f..b500ad7 100644
--- a/Build.c
+++ b/Build.c
@@ -67,16 +67,8 @@ int Build(char *mddev, struct mddev_dev *devlist,
 			missing_disks++;
 			continue;
 		}
-		if (stat(dv->devname, &stb)) {
-			pr_err("Cannot find %s: %s\n",
-				dv->devname, strerror(errno));
-			return 1;
-		}
-		if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-			pr_err("%s is not a block device.\n",
-				dv->devname);
+		if (check_blkdev_via_stat(dv->devname))
 			return 1;
-		}
 	}
 
 	if (s->raiddisks != subdevs) {
@@ -171,16 +163,9 @@ int Build(char *mddev, struct mddev_dev *devlist,
 		int fd;
 		if (strcmp("missing", dv->devname) == 0)
 			continue;
-		if (stat(dv->devname, &stb)) {
-			pr_err("Weird: %s has disappeared.\n",
-				dv->devname);
+		if (check_blkdev_via_stat(dv->devname))
 			goto abort;
-		}
-		if ((stb.st_mode & S_IFMT)!= S_IFBLK) {
-			pr_err("Weird: %s is no longer a block device.\n",
-				dv->devname);
-			goto abort;
-		}
+		stat(dv->devname, &stb);
 		fd = open(dv->devname, O_RDONLY|O_EXCL);
 		if (fd < 0) {
 			pr_err("Cannot open %s: %s\n",
diff --git a/Incremental.c b/Incremental.c
index 28f1f77..78adf18 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -108,18 +108,8 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 
 	struct createinfo *ci = conf_get_create_info();
 
-	if (stat(devname, &stb) < 0) {
-		if (c->verbose >= 0)
-			pr_err("stat failed for %s: %s.\n",
-				devname, strerror(errno));
+	if (check_blkdev_via_stat(devname) && stat(devname, &stb))
 		return rv;
-	}
-	if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-		if (c->verbose >= 0)
-			pr_err("%s is not a block device.\n",
-				devname);
-		return rv;
-	}
 	dfd = dev_open(devname, O_RDONLY);
 	if (dfd < 0) {
 		if (c->verbose >= 0)
diff --git a/Manage.c b/Manage.c
index 618c98b..6aec65d 100644
--- a/Manage.c
+++ b/Manage.c
@@ -1544,17 +1544,11 @@ int Manage_subdevs(char *devname, int fd,
 				close(tfd);
 			} else {
 				int open_err = errno;
-				if (stat(dv->devname, &stb) != 0) {
-					pr_err("Cannot find %s: %s\n",
-					       dv->devname, strerror(errno));
-					goto abort;
-				}
-				if ((stb.st_mode & S_IFMT) != S_IFBLK) {
+				if (check_blkdev_via_stat(dv->devname) &&
+				    stat(dv->devname, &stb)) {
 					if (dv->disposition == 'M')
 						/* non-fatal. Also improbable */
 						continue;
-					pr_err("%s is not a block device.\n",
-					       dv->devname);
 					goto abort;
 				}
 				if (dv->disposition == 'r')
diff --git a/mdadm.h b/mdadm.h
index 612bd86..0aea822 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1422,6 +1422,7 @@ extern int check_raid(int fd, char *name);
 extern int check_partitions(int fd, char *dname,
 			    unsigned long long freesize,
 			    unsigned long long size);
+extern int check_blkdev_via_stat(char *dev);
 
 extern int get_mdp_major(void);
 extern int get_maj_min(char *dev, int *major, int *minor);
diff --git a/super-intel.c b/super-intel.c
index 84dfe2b..924ad8a 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -6953,9 +6953,7 @@ static int validate_geometry_imsm_volume(struct supertype *st, int level,
 	}
 
 	/* This device must be a member of the set */
-	if (stat(dev, &stb) < 0)
-		return 0;
-	if ((S_IFMT & stb.st_mode) != S_IFBLK)
+	if (check_blkdev_via_stat(dev) && stat(dev, &stb))
 		return 0;
 	for (dl = super->disks ; dl ; dl = dl->next) {
 		if (dl->major == (int)major(stb.st_rdev) &&
diff --git a/util.c b/util.c
index 9fc7ba0..03d4256 100644
--- a/util.c
+++ b/util.c
@@ -774,6 +774,21 @@ int ask(char *mesg)
 }
 #endif /* MDASSEMBLE */
 
+int check_blkdev_via_stat(char *dev)
+{
+	struct stat stb;
+
+	if (stat(dev, &stb) != 0) {
+		pr_err("stat failed for %s: %s\n", dev, strerror(errno));
+		return -1;
+	}
+	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
+		pr_err("%s is not a block device.\n", dev);
+		return -1;
+	}
+	return 0;
+}
+
 int is_standard(char *dev, int *nump)
 {
 	/* tests if dev is a "standard" md dev name.
-- 
2.10.2


^ permalink raw reply related

* [PATCH 2/4] mdadm/util:integrate fstat operations into one utility function
From: Zhilong Liu @ 2017-04-01 12:51 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170401125145.14440-1-zlliu@suse.com>

mdadm/util: there are dupilicate codes about fstat checking the
block device, move the operations into one utility function and
make it concise.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Assemble.c    | 10 ++--------
 Create.c      |  5 +----
 Grow.c        |  4 +---
 Incremental.c | 13 +------------
 bitmap.c      | 12 ++++--------
 mdadm.h       |  1 +
 super-intel.c | 10 ++--------
 util.c        | 15 +++++++++++++++
 8 files changed, 27 insertions(+), 43 deletions(-)

diff --git a/Assemble.c b/Assemble.c
index c6571e6..47c6685 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -204,14 +204,8 @@ static int select_devices(struct mddev_dev *devlist,
 				pr_err("cannot open device %s: %s\n",
 				       devname, strerror(errno));
 			tmpdev->used = 2;
-		} else if (fstat(dfd, &stb)< 0) {
-			/* Impossible! */
-			pr_err("fstat failed for %s: %s\n",
-			       devname, strerror(errno));
-			tmpdev->used = 2;
-		} else if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-			pr_err("%s is not a block device.\n",
-			       devname);
+		} else if (check_blkdev_via_fstat(dfd, devname) &&
+			   fstat(dfd, &stb)) {
 			tmpdev->used = 2;
 		} else if (must_be_container(dfd)) {
 			if (st) {
diff --git a/Create.c b/Create.c
index 32987af..e43c1d6 100644
--- a/Create.c
+++ b/Create.c
@@ -326,11 +326,8 @@ int Create(struct supertype *st, char *mddev,
 				dname, strerror(errno));
 			exit(2);
 		}
-		if (fstat(dfd, &stb) != 0 ||
-		    (stb.st_mode & S_IFMT) != S_IFBLK) {
+		if (check_blkdev_via_fstat(dfd, dname)) {
 			close(dfd);
-			pr_err("%s is not a block device\n",
-				dname);
 			exit(2);
 		}
 		close(dfd);
diff --git a/Grow.c b/Grow.c
index 78a3474..0085f91 100755
--- a/Grow.c
+++ b/Grow.c
@@ -145,9 +145,7 @@ int Grow_Add_device(char *devname, int fd, char *newdev)
 		free(st);
 		return 1;
 	}
-	fstat(nfd, &stb);
-	if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-		pr_err("%s is not a block device!\n", newdev);
+	if (check_blkdev_via_fstat(nfd, newdev) && fstat(nfd, &stb)) {
 		close(nfd);
 		free(st);
 		return 1;
diff --git a/Incremental.c b/Incremental.c
index 78adf18..19af045 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -164,19 +164,8 @@ int Incremental(struct mddev_dev *devlist, struct context *c,
 
 	/* 2/ Find metadata, reject if none appropriate (check
 	 *            version/name from args) */
-
-	if (fstat(dfd, &stb) < 0) {
-		if (c->verbose >= 0)
-			pr_err("fstat failed for %s: %s.\n",
-				devname, strerror(errno));
-		goto out;
-	}
-	if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-		if (c->verbose >= 0)
-			pr_err("%s is not a block device.\n",
-				devname);
+	if (check_blkdev_via_fstat(dfd, devname) && fstat(dfd, &stb))
 		goto out;
-	}
 
 	dinfo.disk.major = major(stb.st_rdev);
 	dinfo.disk.minor = minor(stb.st_rdev);
diff --git a/bitmap.c b/bitmap.c
index ccedfd3..69ce640 100644
--- a/bitmap.c
+++ b/bitmap.c
@@ -193,14 +193,7 @@ bitmap_file_open(char *filename, struct supertype **stp, int node_num)
 		return -1;
 	}
 
-	if (fstat(fd, &stb) < 0) {
-		pr_err("failed to determine bitmap file/device type: %s\n",
-			strerror(errno));
-		close(fd);
-		return -1;
-	}
-
-	if ((stb.st_mode & S_IFMT) == S_IFBLK) {
+	if (check_blkdev_via_fstat(fd, filename)) {
 		/* block device, so we are probably after an internal bitmap */
 		if (!st)
 			st = guess_super(fd);
@@ -221,6 +214,9 @@ bitmap_file_open(char *filename, struct supertype **stp, int node_num)
 		}
 
 		*stp = st;
+	} else {
+		close(fd);
+		return -1;
 	}
 
 	return fd;
diff --git a/mdadm.h b/mdadm.h
index 0aea822..e021149 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1423,6 +1423,7 @@ extern int check_partitions(int fd, char *dname,
 			    unsigned long long freesize,
 			    unsigned long long size);
 extern int check_blkdev_via_stat(char *dev);
+extern int check_blkdev_via_fstat(int fd, char *dev);
 
 extern int get_mdp_major(void);
 extern int get_maj_min(char *dev, int *major, int *minor);
diff --git a/super-intel.c b/super-intel.c
index 924ad8a..014616c 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -6604,14 +6604,8 @@ count_volumes_list(struct md_list *devlist, char *homehost,
 			dprintf("cannot open device %s: %s\n",
 				devname, strerror(errno));
 			tmpdev->used = 2;
-		} else if (fstat(dfd, &stb)< 0) {
-			/* Impossible! */
-			dprintf("fstat failed for %s: %s\n",
-				devname, strerror(errno));
-			tmpdev->used = 2;
-		} else if ((stb.st_mode & S_IFMT) != S_IFBLK) {
-			dprintf("%s is not a block device.\n",
-				devname);
+		} else if (check_blkdev_via_fstat(dfd, devname) &&
+			   fstat(dfd, &stb)) {
 			tmpdev->used = 2;
 		} else if (must_be_container(dfd)) {
 			struct supertype *cst;
diff --git a/util.c b/util.c
index 03d4256..09750a0 100644
--- a/util.c
+++ b/util.c
@@ -789,6 +789,21 @@ int check_blkdev_via_stat(char *dev)
 	return 0;
 }
 
+int check_blkdev_via_fstat(int fd, char *dev)
+{
+	struct stat stb;
+
+	if (fstat(fd, &stb) != 0) {
+		pr_err("fstat failed for %s: %s\n", dev, strerror(errno));
+		return -1;
+	}
+	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
+		pr_err("%s is not a block device.\n", dev);
+		return -1;
+	}
+	return 0;
+}
+
 int is_standard(char *dev, int *nump)
 {
 	/* tests if dev is a "standard" md dev name.
-- 
2.10.2


^ permalink raw reply related

* [PATCH 3/4] mdadm/Create:declaring an existing struct within same function
From: Zhilong Liu @ 2017-04-01 12:51 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170401125145.14440-1-zlliu@suse.com>

Create:declaring 'struct stat stb' twice within the same
function, rename stb as stb2 when declares 'struct stat'
at the second time.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Create.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Create.c b/Create.c
index e43c1d6..619e68a 100644
--- a/Create.c
+++ b/Create.c
@@ -865,7 +865,7 @@ int Create(struct supertype *st, char *mddev,
 		for (dnum=0, raid_disk_num=0, dv = devlist ; dv ;
 		     dv=(dv->next)?(dv->next):moved_disk, dnum++) {
 			int fd;
-			struct stat stb;
+			struct stat stb2;
 			struct mdinfo *inf = &infos[dnum];
 
 			if (dnum >= total_slots)
@@ -921,9 +921,9 @@ int Create(struct supertype *st, char *mddev,
 							dv->devname);
 						goto abort_locked;
 					}
-					fstat(fd, &stb);
-					inf->disk.major = major(stb.st_rdev);
-					inf->disk.minor = minor(stb.st_rdev);
+					fstat(fd, &stb2);
+					inf->disk.major = major(stb2.st_rdev);
+					inf->disk.minor = minor(stb2.st_rdev);
 				}
 				if (fd >= 0)
 					remove_partitions(fd);
@@ -944,8 +944,8 @@ int Create(struct supertype *st, char *mddev,
 
 				if (!have_container) {
 					/* getinfo_super might have lost these ... */
-					inf->disk.major = major(stb.st_rdev);
-					inf->disk.minor = minor(stb.st_rdev);
+					inf->disk.major = major(stb2.st_rdev);
+					inf->disk.minor = minor(stb2.st_rdev);
 				}
 				break;
 			case 2:
-- 
2.10.2


^ permalink raw reply related

* [PATCH 4/4] mdadm/Monitor:check the block device when use waitclean parameter
From: Zhilong Liu @ 2017-04-01 12:51 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170401125145.14440-1-zlliu@suse.com>

WaitClean(): stat2devnm() returns NULL for non block devices.
Check the pointer is valid derefencing it. This would happen
when using --waitclean, such as the 'f' and 'd' file type,
causing a core dump.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Monitor.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Monitor.c b/Monitor.c
index 036a561..101b765 100644
--- a/Monitor.c
+++ b/Monitor.c
@@ -1076,6 +1076,8 @@ int WaitClean(char *dev, int sock, int verbose)
 			pr_err("Couldn't open %s: %s\n", dev, strerror(errno));
 		return 1;
 	}
+	if (check_blkdev_via_fstat(fd, dev))
+		return 2;
 
 	strcpy(devnm, fd2devnm(fd));
 	mdi = sysfs_read(fd, devnm, GET_VERSION|GET_LEVEL|GET_SAFEMODE);
-- 
2.10.2


^ permalink raw reply related

* Re: [PATCH] Revert "md: raid1: use bio helper in process_checks()"
From: Henrique de Moraes Holschuh @ 2017-04-01 21:51 UTC (permalink / raw)
  To: Wols Lists
  Cc: Ming Lei, Arnd Bergmann, Shaohua Li, NeilBrown, Jens Axboe,
	colyli@suse.de, Guoqing Jiang, Mike Christie,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Linux Kernel Mailing List
In-Reply-To: <58DA848B.7030308@youngman.org.uk>

On Tue, 28 Mar 2017, Wols Lists wrote:
> What Arnd is doing is commonly called "defensive programming", and
> unfortunately reality shows us that it is usually worth its weight in
> gold. That's why you put ASSERTs in code - so that if somebody does
> something stupid by accident, it blows up. This is just more of the same.

You know, BUG_ON(vcnt1 != vcnt2) might address this quite nicely [if you
want to go the assert() way, that is!], since __attribute__((noreturn))
is set for the codepath taken when the BUG_ON condition triggers...

That said, if there is something less insane to be done than OOPsing
when the world has gone strange and vcnt1 != vcnt2, that ought to be a
better solution...

All of this assumes vcnt1 == vcnt2 is really the only possibily correct
situation.

-- 
  Henrique Holschuh

^ permalink raw reply

* Re: RAID6 rebuild oddity
From: NeilBrown @ 2017-04-03  2:09 UTC (permalink / raw)
  To: Dan Williams; +Cc: Brad Campbell, Linux-RAID
In-Reply-To: <CAPcyv4gwG6yRMKF4wdhT_cPfCn=eucLX=ON7DWTA2gw0ptgmbQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9445 bytes --]

On Fri, Mar 31 2017, Dan Williams wrote:

> On Wed, Mar 29, 2017 at 5:49 PM, NeilBrown <neilb@suse.com> wrote:
>> On Wed, Mar 29 2017, Brad Campbell wrote:
>>
>>> On 29/03/17 12:08, NeilBrown wrote:
>>>
>>>>
>>>> sdj is getting twice the utilization of the others but only about 10%
>>>> more rKB/sec.  That suggests lots of seeking.
>>>
>>> Yes, something is not entirely sequential.
>>>
>>>> Does "fuser /dev/sdj" report anything funny?
>>>
>>> No. No output. As far as I can tell nothing should be touching the disks
>>> other than the md kernel thread.
>>>
>>>> Is there filesystem IO happening? If so, what filesystem?
>>>> Have you told the filesystem about the RAID layout?
>>>> Maybe the filesystem keeps reading some index blocks that are always on
>>>> the same drive.
>>>
>>> No. I probably wasn't as clear as I should have been in the initial
>>> post. There was nothing mounted at the time.
>>>
>>> Right now the array contains one large LUKS container (dm-crypt). This
>>> was mounted and a continuous dd done to the dm device to zero it out :
>>>
>>> 4111195+1 records in
>>> 4111195+1 records out
>>> 34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s
>>>
>>> So there is no filesystem on the drive.
>>>
>>> I failed and removed sdi :
>>>
>>> root@test:~# mdadm --fail /dev/md0 /dev/sdi
>>> mdadm: set /dev/sdi faulty in /dev/md0
>>> root@test:~# mdadm --remove /dev/md0 /dev/sdi
>>> mdadm: hot removed /dev/sdi from /dev/md0
>>> root@test:~# mdadm --zero-superblock /dev/sdi
>>> root@test:~# mdadm --add /dev/md0 /dev/sdi
>>> mdadm: added /dev/sdi
>>>
>>> Rebooted the machine to remove all tweaks of things like stripe cache
>>> size, readahead, NCQ and anything else.
>>>
>>> I opened the LUKS container, dd'd a meg to the start to write to the
>>> array and kick off the resync, then closed the LUKS container. At this
>>> point dm should no longer be touching the drive and I've verified the
>>> device has gone.
>>>
>>> I then ran sync a couple of times and waited a couple of minutes until I
>>> was positive _nothing_ was touching md0, then ran :
>>>
>>> blktrace -w 5 /dev/sd[bcdefgij] /dev/md0
>>>
>>>> So the problem moves from drive to drive?  Strongly suggests filesystem
>>>> metadata access to me.
>>>
>>> Again, sorry for me not being clear. The situation changes on a resync
>>> specific basis. For example the reproduction I've done now I popped out
>>> sdi rather than sdb, and now the bottleneck is sdg. It is the same if
>>> the exact circumstances remain the same.
>>
>> Now *that* is interesting!
>>
>> In the first example you gave, device 0 was rebuilding and device 7 was
>> slow.
>> Now device 6 is rebuilding and device5 is slow.  Surely you see the
>> pattern?
>>
>> Also, I previously observed that the slow device was getting 10% more
>> read traffic, but I was being imprecise.
>> In the first example  it was 16.2%, in the second it is 14.5.
>> These are close to 1/7.
>>
>> There are 8 devices in the array, so there are 8 different layouts for
>> stripes (the parity block can be on 8 different devices).  8 is close to
>> 7.
>>
>> How recovery *should* work on RAID6 is that for each strip we read all
>> blocks except for the Q block (which isn't needed to rebuild a single
>> device) and the block which has failed (of course).  For the stripe
>> where the Q block has failed, we don't read the P block.  Just read all
>> the data and calculate Q from that.
>>
>> So for every 8-device strip, we read from 6 devices.  5 Data plus parity
>> when data has failed, 6 data when P or Q has failed.
>> So across each of the 8 different strip layouts, there are 48 reads
>> Spread across 7 working devices....  using lower-case for blocks that
>> aren't read and assuming first device is being recovered:
>>
>>            d  D  D  D  D  D  P  q
>>            d  D  D  D  D  P  q  D
>>            d  D  D  D  P  q  D  D
>>            d  D  D  P  q  D  D  D
>>            d  D  P  q  D  D  D  D
>>            d  P  q  D  D  D  D  D
>>            p  q  D  D  D  D  D  D
>>            q  D  D  D  D  D  D  p
>> totals:    0  7  7  7  7  7  7  6
>>
>> The device just "before" the missing device gets fewer reads,
>> because we don't need to read P from that device.
>> But we are seeing that it gets more reads.  The numbers suggest that
>> it gets 8 reads (1/7 more than the others).  But the numbers also
>> suggest that it gets lots of seeks.  Maybe after all the other devices
>> have break read for one stripe, md/raid6 decides it does need those
>> blocks and goes back to read them?  That would explain the seeking
>> and the increased number of reads.
>>
>> Time to look at the blktrace...
>> Looking at sdj, each request is Queued 8 times (I wonder why) but the
>> requests are completely sequential.
>>
>>   blkparse sdj* | awk '$6 == "Q" && $8 != prev {print $8, $8-prev ; prev=$8}'  | awk '$2 != 8'
>>
>> This means the 'Q' block is being read.  Not what I expected, but not
>> too surprising.
>>
>> Looking at sdg, which is the problem device:
>>
>>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {print $8-prev ; prev=$8}'  \
>>     | sort | uniq -c | sort -n | tail -60
>>
>> There are lots of places we skip forward 904 sectors.  That is an 8
>> sector read and a 896 sector seek. 896 sectors = 448K or 7 chunks
>> (64K chunk size).
>> These 7-chunk seeks are separated by 16 8-sector reads.  So it seems to
>> be reading all the P (or maybe Q) blocks from a sequence of stripes.
>>
>> There are also lots of large back/forward seeks. They range from -31328
>> to -71280 backward and 27584 to 66408  (with a couple of smaller ones).
>>
>> If we ignore all reads to sdg that are the result of seeking backwards -
>> i.e. that are earlier that the largest offset seen so far:
>>
>>   blkparse sdg* | awk '$6 == "Q" && $8 != prev {if ($8>max) {print $8; max=$8} ; prev=$8}' | awk '{print $1-prev; prev=$1}' | grep  -v 8
>>
>> we seek that (after an initial section) every block is being read.
>> So it seems that sdg is seeking backwards to read some blocks that it
>> has already read.
>>
>> So what really happens in that all the working devices read all blocks
>> (although they don't really need Q), and sdg needs to seek backwards to
>> re-read 1/8 of all blocks, probably the P block.
>>
>> So we should be seeing the rkB/s for the slow disk 1/8 more than the
>> others.  i.e. 12.5%, not 14% or 16%.  The difference bothers me, but not
>> very much.
>>
>> Now to look at the code.
>>
>> need_this_block() always returns 1 when s->syncing, so that is why
>> we already read the Q block.  The distinction between recovery and
>> resync isn't very strong in raid5/6.
>>
>> So on the stripes where Q has failed, we read all the Data and P.
>> Then handle_parity_checks6() notice that there are enough devices that
>> something is worth checking, so it checks the P.  This is destructive!
>> of P.  The D blocks are xored into P and P is checked for all-zero.
>>
>> I always get lost following this next bit....
>> I think that after deciding zero_sum_result is zero, check_state is set
>> to check_state_compute_result.
>> But before that is processed, we go around the loop again and notice
>> that P is missing (because handle_parity_checks6() cleared R5_UPTODATE)
>> so need_this_block() decided that we need to read it back in again.
>> We don't calculate it because we only calculate blocks on failed
>> devices (I'm sure we didn't used to do that).
>> So we read P back in ... just as we see from the trace.
>>
>> Patch below makes the symptom go away in my testing, which confirms
>> that I'm on the right track.
>>
>> Dan Williams changed the code to only recompute blocks from failed
>> devices, way back in 2008.
>> Commit: c337869d9501 ("md: do not compute parity unless it is on a failed drive")
>>
>> Back then we didn't have raid6 in the same code, so the fix was probably
>> fine.
>>
>> I wonder what the best fix is.... Dan... Any thoughts?
>
> First thought is "Wow!" that's a great bit of detective work.

Not many problems allow for that sort of detective work... It is fun
when they come along :-)

>
> Second thought is that since we validated P as being in-sync maybe we
> can flag at is safe for compute and not re-read it. I wish I had
> written a unit test for the condition where we were missing some
> writebacks after computing that lead to commit c337869d9501  ("md: do
> not compute parity unless it is on a failed drive").
>
> However, I think we already have enough information to trust P.
>
>                         /* The only possible failed device holds Q, so it
>                          * makes sense to check P (If anything else were failed,
>                          * we would have used P to recreate it).
>                          */
>
> So I think your patch is actually already correct, would just need
> comment like the following so the exception is clear:
>
>                 /*
>                  * In the raid6 case if the only non-uptodate disk is P
>                  * then we already trusted P to compute the other failed
>                  * drives. It is safe to compute rather than re-read P.
>                  */
>
> Thoughts?

I'm convinced.  If we have all of D and Q, we *must* have read (or
generated) P recently, so we cannot hide an error here.
I extended the comment to explain why we don't always compute missing
blocks.
I've think your response qualifies for a Reviewed-by - hope that's OK.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH] md/raid6: Fix anomily when recovering a single device in RAID6.
From: NeilBrown @ 2017-04-03  2:11 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Brad Campbell, Linux-RAID, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 2202 bytes --]

When recoverying a single missing/failed device in a RAID6,
those stripes where the Q block is on the missing device are
handled a bit differently.  In these cases it is easy to
check that the P block is correct, so we do.  This results
in the P block be destroy.  Consequently the P block needs
to be read a second time in order to compute Q.  This causes
lots of seeks and hurts performance.

It shouldn't be necessary to re-read P as it can be computed
from the DATA.  But we only compute blocks on missing
devices, since c337869d9501 ("md: do not compute parity
unless it is on a failed drive").

So relax the change made in that commit to allow computing
of the P block in a RAID6 which it is the only missing that
block.

This makes RAID6 recovery run much faster as the disk just
"before" the recovering device is no longer seeking
back-and-forth.

Reported-by-tested-by: Brad Campbell <lists2009@fnarfbargle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c523fd69a7bc..aeb2e236a247 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3617,9 +3617,20 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
 		BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
 		BUG_ON(test_bit(R5_Wantread, &dev->flags));
 		BUG_ON(sh->batch_head);
+
+		/*
+		 * In the raid6 case if the only non-uptodate disk is P
+		 * then we already trusted P to compute the other failed
+		 * drives. It is safe to compute rather than re-read P.
+		 * In other cases we only compute blocks from failed
+		 * devices, otherwise check/repair might fail to detect
+		 * a real inconsistency.
+		 */
+
 		if ((s->uptodate == disks - 1) &&
+		    ((sh->qd_idx >= 0 && sh->pd_idx == disk_idx) ||
 		    (s->failed && (disk_idx == s->failed_num[0] ||
-				   disk_idx == s->failed_num[1]))) {
+				   disk_idx == s->failed_num[1])))) {
 			/* have disk failed, and we're requested to fetch it;
 			 * do compute it
 			 */
-- 
2.12.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH] md/raid10: remove unnecessary checks with 'first'
From: NeilBrown @ 2017-04-03  2:21 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid; +Cc: shli
In-Reply-To: <1491017900-8150-1-git-send-email-gqjiang@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1890 bytes --]

On Sat, Apr 01 2017, Guoqing Jiang wrote:

> Since the value of first is always '1', we can
> set min_offset_diff directly.

Alternately, maybe the fact that first is always '1', is a sign of a bug
that should be fixed.

The calculation of "min_offset_diff" doesn't make much sense now.  It is
just the last_offset_diff.

I think the correct fix is to set "first = 0;" at the end of the
rdev_for_each() loop.

Thanks for spotting this.

NeilBrown

>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/raid10.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 0f13d57..edaf8f4 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3702,7 +3702,6 @@ static int raid10_run(struct mddev *mddev)
>  	struct md_rdev *rdev;
>  	sector_t size;
>  	sector_t min_offset_diff = 0;
> -	int first = 1;
>  	bool discard_supported = false;
>  
>  	if (mddev->private == NULL) {
> @@ -3758,8 +3757,7 @@ static int raid10_run(struct mddev *mddev)
>  			diff = -diff;
>  		if (diff < 0)
>  			diff = 0;
> -		if (first || diff < min_offset_diff)
> -			min_offset_diff = diff;
> +		min_offset_diff = diff;
>  
>  		if (mddev->gendisk)
>  			disk_stack_limits(mddev->gendisk, rdev->bdev,
> @@ -4140,7 +4138,6 @@ static int raid10_start_reshape(struct mddev *mddev)
>  
>  	unsigned long before_length, after_length;
>  	sector_t min_offset_diff = 0;
> -	int first = 1;
>  	struct geom new;
>  	struct r10conf *conf = mddev->private;
>  	struct md_rdev *rdev;
> @@ -4169,8 +4166,7 @@ static int raid10_start_reshape(struct mddev *mddev)
>  				diff = -diff;
>  			if (diff < 0)
>  				diff = 0;
> -			if (first || diff < min_offset_diff)
> -				min_offset_diff = diff;
> +			min_offset_diff = diff;
>  		}
>  	}
>  
> -- 
> 2.6.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v1] mdadm/grow: reshape would be stuck from raid1 to raid5
From: NeilBrown @ 2017-04-03  4:36 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <20170330073808.6176-1-zlliu@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1758 bytes --]

On Thu, Mar 30 2017, Zhilong Liu wrote:

> systemctl doesn't interpret mdadm-grow-continue@.service
> correctly due to the wrong argument provided in [service],
> it should be corrected %I as %i. Otherwise, if the service
> cannot start by systemctl and the reshap progress would be
> stuck all time when grows array from raid1 to raid5.
>
> reproduce steps:
> ./mdadm -CR /dev/md0 -l1 -b internal -n2 /dev/loop[0-1]
> ./mdadm --grow /dev/md0 -l5 -n3 -a /dev/loop2
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>  systemd/mdadm-grow-continue@.service | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/systemd/mdadm-grow-continue@.service b/systemd/mdadm-grow-continue@.service
> index 5c667d2..882bc0b 100644
> --- a/systemd/mdadm-grow-continue@.service
> +++ b/systemd/mdadm-grow-continue@.service
> @@ -10,7 +10,7 @@ Description=Manage MD Reshape on /dev/%I
>  DefaultDependencies=no
>  
>  [Service]
> -ExecStart=BINDIR/mdadm --grow --continue /dev/%I
> +ExecStart=BINDIR/mdadm --grow --continue /dev/%i

Do you know why this makes a difference?  I don't think it should.
man systemd.unit says that "%i" is the "Instance name" while "%I" is the
"Unescaped instance name".

The Instance name here is something like "md0" so there is nothing to
escape.

I would rather not change it unless we know exactly why it is broken,
and I don't find your explanation to be convincing.

NeilBrown


>  StandardInput=null
>  StandardOutput=null
>  StandardError=null
> -- 
> 2.10.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Hannes Reinecke @ 2017-04-03  6:06 UTC (permalink / raw)
  To: Christoph Hellwig, axboe, martin.petersen, agk, snitzer, shli,
	philipp.reisner, lars.ellenberg
  Cc: linux-block, linux-scsi, drbd-dev, dm-devel, linux-raid
In-Reply-To: <20170331163313.31821-1-hch@lst.de>

On 03/31/2017 06:32 PM, Christoph Hellwig wrote:
> This series makes REQ_OP_WRITE_ZEROES the only zeroing offload
> supported by the block layer, and switches existing implementations
> of REQ_OP_DISCARD that correctly set discard_zeroes_data to it,
> removes incorrect discard_zeroes_data, and also switches WRITE SAME
> based zeroing in SCSI to this new method.
> 
> The series is against the block for-next tree.
> 
> A git tree is also avaiable at:
> 
>     git://git.infradead.org/users/hch/block.git discard-rework
> 
> Gitweb:
> 
>     http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/discard-rework
Thank you for doing this.

For this series:

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* Re: [PATCH] raid6/altivec: adding vpermxor implementation for raid6 Q syndrome
From: Daniel Axtens @ 2017-04-03 11:37 UTC (permalink / raw)
  To: Matt Brown, linux-raid; +Cc: linuxppc-dev
In-Reply-To: <20170330051313.31358-1-matthew.brown.dev@gmail.com>

Hi Matt,

> The raid6 Q syndrome check has been optimised using the vpermxor
> instruction.  This instruction was made available with POWER8, ISA version
> 2.07. It allows for both vperm and vxor instructions to be done in a single
> instruction. This has been tested for correctness on a ppc64le vm with a
> basic RAID6 setup containing 5 drives.
>
> The performance benchmarks are from the raid6test in the /lib/raid6/test
> directory. These results are from an IBM Firestone machine with ppc64le
> architecture. The benchmark results show a 35% speed increase over the best
> existing algorithm for powerpc (altivec). The raid6test has also been run
> on a big-endian ppc64 vm to ensure it also works for big-endian
> architectures.
>
> Performance benchmarks:
>
> 	raid6: altivecx4 gen() 18773 MB/s
> 	raid6: altivecx8 gen() 19438 MB/s
>
> 	raid6: vpermxor4 gen() 25112 MB/s
> 	raid6: vpermxor8 gen() 26279 MB/s
>
Well done!

> Note: Also fixed a small bug in pq.h regarding a missing and mismatched
> ifdef statement

So this could be slightly better explained:

4f8c55c5ad49 ("lib/raid6: build proper files on corresponding arch")
breaks raid6test on ppc by rearranging the ifdefs incorrectly.

You could possibly split that fix out into its own separate commit, but
that is probably overkill.

You should, however, probably include just before your Signed-off-by:
Fixes: 4f8c55c5ad49 ("lib/raid6: build proper files on corresponding arch")

> Signed-off-by: Matt Brown <matthew.brown.dev@gmail.com>
> ---
>  include/linux/raid/pq.h |   4 ++
>  lib/raid6/Makefile      |  27 ++++++++++++-
>  lib/raid6/algos.c       |   4 ++
>  lib/raid6/altivec.uc    |   3 ++
>  lib/raid6/test/Makefile |  28 ++++++++++++-
>  lib/raid6/vpermxor.uc   | 102 ++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 165 insertions(+), 3 deletions(-)
>  create mode 100644 lib/raid6/vpermxor.uc
>
> diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
> index 4d57bba..3df9aa6 100644
> --- a/include/linux/raid/pq.h
> +++ b/include/linux/raid/pq.h
> @@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2;
>  extern const struct raid6_calls raid6_avx512x4;
>  extern const struct raid6_calls raid6_tilegx8;
>  extern const struct raid6_calls raid6_s390vx8;
> +extern const struct raid6_calls raid6_vpermxor1;
> +extern const struct raid6_calls raid6_vpermxor2;
> +extern const struct raid6_calls raid6_vpermxor4;
> +extern const struct raid6_calls raid6_vpermxor8;
>  
>  struct raid6_recov_calls {
>  	void (*data2)(int, size_t, int, int, void **);
> diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
> index 3057011..7775aad 100644
> --- a/lib/raid6/Makefile
> +++ b/lib/raid6/Makefile
> @@ -4,7 +4,8 @@ raid6_pq-y	+= algos.o recov.o tables.o int1.o int2.o int4.o \
>  		   int8.o int16.o int32.o
>  
>  raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o avx512.o recov_avx512.o
> -raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
> +raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
> +				vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
>  raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
>  raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
>  raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
> @@ -88,6 +89,30 @@ $(obj)/altivec8.c:   UNROLL := 8
>  $(obj)/altivec8.c:   $(src)/altivec.uc $(src)/unroll.awk FORCE
>  	$(call if_changed,unroll)
>  
> +CFLAGS_vpermxor1.o += $(altivec_flags)
> +targets += vpermxor1.c
> +$(obj)/vpermxor1.c: UNROLL := 1
> +$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
> +	$(call if_changed,unroll)
> +
> +CFLAGS_vpermxor2.o += $(altivec_flags)
> +targets += vpermxor2.c
> +$(obj)/vpermxor2.c: UNROLL := 2
> +$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
> +	$(call if_changed,unroll)
> +
> +CFLAGS_vpermxor4.o += $(altivec_flags)
> +targets += vpermxor4.c
> +$(obj)/vpermxor4.c: UNROLL := 4
> +$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
> +	$(call if_changed,unroll)
> +
> +CFLAGS_vpermxor8.o += $(altivec_flags)
> +targets += vpermxor8.c
> +$(obj)/vpermxor8.c: UNROLL := 8
> +$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
> +	$(call if_changed,unroll)
> +
>  CFLAGS_neon1.o += $(NEON_FLAGS)
>  targets += neon1.c
>  $(obj)/neon1.c:   UNROLL := 1
> diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
> index 7857049..edd4f69 100644
> --- a/lib/raid6/algos.c
> +++ b/lib/raid6/algos.c
> @@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = {
>  	&raid6_altivec2,
>  	&raid6_altivec4,
>  	&raid6_altivec8,
> +	&raid6_vpermxor1,
> +	&raid6_vpermxor2,
> +	&raid6_vpermxor4,
> +	&raid6_vpermxor8,
>  #endif
>  #if defined(CONFIG_TILEGX)
>  	&raid6_tilegx8,
> diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc
> index 682aae8..d20ed0d 100644
> --- a/lib/raid6/altivec.uc
> +++ b/lib/raid6/altivec.uc
> @@ -24,10 +24,13 @@
>  
>  #include <linux/raid/pq.h>
>  
> +#ifdef CONFIG_ALTIVEC
> +
>  #include <altivec.h>
>  #ifdef __KERNEL__
>  # include <asm/cputable.h>
>  # include <asm/switch_to.h>
> +#endif /* __KERNEL__ */
>  
>  /*
>   * This is the C data type to use.  We use a vector of
> diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
> index 2c7b60e..29ebb39 100644
> --- a/lib/raid6/test/Makefile
> +++ b/lib/raid6/test/Makefile
> @@ -47,13 +47,25 @@ else
>                           gcc -c -x c - >&/dev/null && \
>                           rm ./-.o && echo yes)
>          ifeq ($(HAS_ALTIVEC),yes)
> -                OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
> +		OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
Your editor seems to have replaced the spaces with tabs here - you could
drop this hunk entirely.

>          endif
>  endif
>  ifeq ($(ARCH),tilegx)
>  OBJS += tilegx8.o
>  endif
>  
> +ifeq ($(ARCH),ppc64le)
> +	CFLAGS += -I../../../arch/powerpc/include
> +	CFLAGS += -DCONFIG_ALTIVEC
> +	OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
> +		vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
> +else ifeq ($(ARCH),ppc64)
> +	CFLAGS += -I../../../arch/powerpc/include
> +	CFLAGS += -DCONFIG_ALTIVEC
> +	OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
> +		vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
> +endif

These two test bodies are the same, right? Could you replace this all
with a test for HAS_ALTIVEC?

> +
>  .c.o:
>  	$(CC) $(CFLAGS) -c -o $@ $<
>  
> @@ -97,6 +109,18 @@ altivec4.c: altivec.uc ../unroll.awk
>  altivec8.c: altivec.uc ../unroll.awk
>  	$(AWK) ../unroll.awk -vN=8 < altivec.uc > $@
>  
> +vpermxor1.c: vpermxor.uc ../unroll.awk
> +	$(AWK) ../unroll.awk -vN=1 < vpermxor.uc > $@
> +
> +vpermxor2.c: vpermxor.uc ../unroll.awk
> +	$(AWK) ../unroll.awk -vN=2 < vpermxor.uc > $@
> +
> +vpermxor4.c: vpermxor.uc ../unroll.awk
> +	$(AWK) ../unroll.awk -vN=4 < vpermxor.uc > $@
> +
> +vpermxor8.c: vpermxor.uc ../unroll.awk
> +	$(AWK) ../unroll.awk -vN=8 < vpermxor.uc > $@
> +
>  int1.c: int.uc ../unroll.awk
>  	$(AWK) ../unroll.awk -vN=1 < int.uc > $@
>  
> @@ -122,7 +146,7 @@ tables.c: mktables
>  	./mktables > tables.c
>  
>  clean:
> -	rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c neon*.c tables.c raid6test
> +	rm -f *.o *.a mktables mktables.c *.uc int*.c altivec*.c vpermxor*.c neon*.c tables.c raid6test
>  	rm -f tilegx*.c
>  
>  spotless: clean
> diff --git a/lib/raid6/vpermxor.uc b/lib/raid6/vpermxor.uc
> new file mode 100644
> index 0000000..96f48e3
> --- /dev/null
> +++ b/lib/raid6/vpermxor.uc
> @@ -0,0 +1,102 @@
> +/*
> + * Copyright 2017, Matt Brown, IBM Corp.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + *
> + * vpermxor$#.c
> + *
> + * $#-way unrolled portable integer math RAID-6 instruction set
> + * This file is postprocessed using unroll.awk
> + *
> + * vpermxor$#.c makes use of the vpermxor opcode to optimise the RAID6 Q
> + * syndrome calculations.
> + * This can be run on systems which have both Altivec and the vpermxor opcode.
> + *
> + * This instruction was introduced in POWER8 - ISA v2.07.
> + */
> +
> +#include <linux/raid/pq.h>
> +#ifdef CONFIG_ALTIVEC
> +
> +#include <altivec.h>
> +#ifdef __KERNEL__
> +#include <asm/cputable.h>
> +#include <asm/switch_to.h>
> +#endif
> +
> +typedef vector unsigned char unative_t;
> +#define NSIZE sizeof(unative_t)
> +

A comment about how these are generated wouldn't go astray. Even if it's
just a link to H. Peter Anvin's paper :)

> +static const vector unsigned char gf_low = {0x1e, 0x1c, 0x1a, 0x18, 0x16, 0x14,
> +					    0x12, 0x10, 0x0e, 0x0c, 0x0a, 0x08,
> +					    0x06, 0x04, 0x02,0x00};
> +static const vector unsigned char gf_high = {0xfd, 0xdd, 0xbd, 0x9d, 0x7d, 0x5d,
> +					     0x3d, 0x1d, 0xe0, 0xc0, 0xa0, 0x80,
> +					     0x60, 0x40, 0x20, 0x00};
> +
> +static void noinline raid6_vpermxor$#_gen_syndrome_real(int disks, size_t bytes,
> +							void **ptrs)
> +{
> +	u8 **dptr = (u8 **)ptrs;
> +	u8 *p, *q;
> +	int d, z, z0;
> +	unative_t wp$$, wq$$, wd$$;
> +
> +	z0 = disks - 3;		/* Highest data disk */
> +	p = dptr[z0+1];		/* XOR parity */
> +	q = dptr[z0+2];		/* RS syndrome */
> +
> +	for (d = 0; d < bytes; d += NSIZE*$#) {
> +		wp$$ = wq$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE];
> +
> +		for (z = z0-1; z>=0; z--) {
> +			wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE];
> +			/* P syndrome */
> +			wp$$ = vec_xor(wp$$, wd$$);
> +
> +			/*Q syndrome */
> +			__asm__ __volatile__("vpermxor %0,%1,%2,%3\n\t":"=v"(wq$$):"v"(gf_high), "v"(gf_low), "v"(wq$$));

A couple of things:
 - does this need to be marked volatile?

 - I don't think this needs a \n\t at the end of the line

 - It's a very long line

 - I think there should be spaces around the colons but I'm not 100%
   clear on the coding style here.

> +			wq$$ = vec_xor(wq$$, wd$$);
> +		}
> +		*(unative_t *)&p[d+NSIZE*$$] = wp$$;
> +		*(unative_t *)&q[d+NSIZE*$$] = wq$$;
> +	}
> +}
> +
> +static void raid6_vpermxor$#_gen_syndrome(int disks, size_t bytes, void **ptrs)
> +{
> +	preempt_disable();
> +	enable_kernel_altivec();
> +
> +	raid6_vpermxor$#_gen_syndrome_real(disks, bytes, ptrs);
> +
> +	disable_kernel_altivec();
> +	preempt_enable();
> +}

So there's a similar sort of function in
arch/powerpc/crypto/crc32c-vpmsum_glue.c - see around line 35.

In that function, the flow is:
 pagefault_disable();
 enable_kernel_altivec();
 <vectorised function>
 pagefault_enable();

There are a few things that it would be nice (but by no means essential)
to find out:
 - what is the difference between pagefault and prempt enable/disable
 - is it required to disable altivec after the end of the function or
   can we leave that enabled?

> +
> +int raid6_have_altivec_vpermxor(void);
> +#if $# == 1
> +int raid6_have_altivec_vpermxor(void)
> +{
> +	/* Check if CPU has both altivec and the vpermxor instruction*/
Please add a space: s|ion*/|ion */|
> +# ifdef __KERNEL__
> +	return (cpu_has_feature(CONFIG_ALTIVEC) &&
> +		cpu_has_feature(CPU_FTR_ARCH_207S));
I assume this is future-proof - an ISA 3.00 cpu will advertise 2.07S
compat?

> +# else
> +	return 1;
> +#endif
> +
> +}
> +#endif
> +
> +const struct raid6_calls raid6_vpermxor$# = {
> +	raid6_vpermxor$#_gen_syndrome,
> +	NULL,
> +	raid6_have_altivec_vpermxor,
> +	"vpermxor$#",
> +	0
> +};
> +#endif
> -- 
> 2.9.3

Apart from that this patch looks good and I expect I will be able to
formally Review v2.

Regards,
Daniel

^ permalink raw reply

* [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-03 12:05 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, NeilBrown, Jinpu Wang


blk_attempt_plug_merge() try to merge bio into request and chain them
by 'bi_next', while after the bio is done inside request, we forgot to
reset the 'bi_next'.

This lead into BUG while removing all the underlying devices from md-raid1,
the bio once go through:

  md_do_sync()
    sync_request()
      generic_make_request()
        blk_queue_bio()
          blk_attempt_plug_merge()
            CHAINED HERE

will keep chained and reused by:

  raid1d()
    sync_request_write()
      generic_make_request()
        BUG_ON(bio->bi_next)

After reset the 'bi_next' this can no longer happen.

Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
---
 block/blk-core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43b7d06..91223b2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
                struct bio *bio = req->bio;
                unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);

-               if (bio_bytes == bio->bi_iter.bi_size)
+               if (bio_bytes == bio->bi_iter.bi_size) {
                        req->bio = bio->bi_next;
+                       bio->bi_next = NULL;
+               }

                req_bio_endio(req, bio, bio_bytes, error);

-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH] dm ioctl: Remove double parentheses
From: Matthias Kaehlcke @ 2017-04-03 16:56 UTC (permalink / raw)
  To: Joe Perches
  Cc: Alasdair Kergon, Mike Snitzer, dm-devel, Shaohua Li, linux-raid,
	linux-kernel, Grant Grundler, Michael Davidson, Greg Hackmann
In-Reply-To: <1491012453.27353.25.camel@perches.com>

Hi Joe,

El Fri, Mar 31, 2017 at 07:07:33PM -0700 Joe Perches ha dit:

> On Fri, 2017-03-31 at 18:50 -0700, Matthias Kaehlcke wrote:
> > El Thu, Mar 16, 2017 at 09:48:30AM -0700 Matthias Kaehlcke ha dit:
> > 
> > > The extra pair of parantheses is not needed and causes clang to generate
> > > the following warning:
> > > 
> > > drivers/md/dm-ioctl.c:1776:11: error: equality comparison with extraneous parentheses [-Werror,-Wparentheses-equality]
> > >         if ((cmd == DM_DEV_CREATE_CMD)) {
> > >              ~~~~^~~~~~~~~~~~~~~~~~~~
> > > drivers/md/dm-ioctl.c:1776:11: note: remove extraneous parentheses around the comparison to silence this warning
> > >         if ((cmd == DM_DEV_CREATE_CMD)) {
> > >             ~    ^                   ~
> > > drivers/md/dm-ioctl.c:1776:11: note: use '=' to turn this equality comparison into an assignment
> > >         if ((cmd == DM_DEV_CREATE_CMD)) {
> 
> There are dozens of these comparisons in the kernel.
> Are you fixing them all or just this one?

For now I focus on occurrences that pop up in my builds. So far there
have been two more:

https://patchwork.kernel.org/patch/9626833/
https://patchwork.kernel.org/patch/9631591/

Thanks

Matthias

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: NeilBrown @ 2017-04-03 21:25 UTC (permalink / raw)
  To: Michael Wang, linux-kernel@vger.kernel.org, linux-block,
	linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <9505ff12-7307-7dec-76b5-2a233a592634@profitbricks.com>

[-- Attachment #1: Type: text/plain, Size: 1927 bytes --]

On Mon, Apr 03 2017, Michael Wang wrote:

> blk_attempt_plug_merge() try to merge bio into request and chain them
> by 'bi_next', while after the bio is done inside request, we forgot to
> reset the 'bi_next'.
>
> This lead into BUG while removing all the underlying devices from md-raid1,
> the bio once go through:
>
>   md_do_sync()
>     sync_request()
>       generic_make_request()

This is a read request from the "first" device.

>         blk_queue_bio()
>           blk_attempt_plug_merge()
>             CHAINED HERE
>
> will keep chained and reused by:
>
>   raid1d()
>     sync_request_write()
>       generic_make_request()

This is a write request to some other device, isn't it?

If sync_request_write() is using a bio that has already been used, it
should call bio_reset() and fill in the details again.
However I don't see how that would happen.
Can you give specific details on the situation that triggers the bug?

Thanks,
NeilBrown


>         BUG_ON(bio->bi_next)
>
> After reset the 'bi_next' this can no longer happen.
>
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
>  block/blk-core.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 43b7d06..91223b2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>                 struct bio *bio = req->bio;
>                 unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>
> -               if (bio_bytes == bio->bi_iter.bi_size)
> +               if (bio_bytes == bio->bi_iter.bi_size) {
>                         req->bio = bio->bi_next;
> +                       bio->bi_next = NULL;
> +               }
>
>                 req_bio_endio(req, bio, bio_bytes, error);
>
> -- 
> 2.5.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] raid6/altivec: adding vpermxor implementation for raid6 Q syndrome
From: Daniel Axtens @ 2017-04-03 21:44 UTC (permalink / raw)
  To: Matt Brown, linux-raid; +Cc: linuxppc-dev
In-Reply-To: <877f31n2mn.fsf@possimpible.ozlabs.ibm.com>

> In that function, the flow is:
>  pagefault_disable();
>  enable_kernel_altivec();
>  <vectorised function>
>  pagefault_enable();
>
> There are a few things that it would be nice (but by no means essential)
> to find out:
>  - what is the difference between pagefault and prempt enable/disable
>  - is it required to disable altivec after the end of the function or
>    can we leave that enabled?

Answering my own question here, dc4fbba11e46 ("powerpc: Create
disable_kernel_{fp,altivec,vsx,spe}()") adds the disable_ function, and
it's a no-op except under debug conditions. So it should stay.

Regards,
Daniel


>
>> +
>> +int raid6_have_altivec_vpermxor(void);
>> +#if $# == 1
>> +int raid6_have_altivec_vpermxor(void)
>> +{
>> +	/* Check if CPU has both altivec and the vpermxor instruction*/
> Please add a space: s|ion*/|ion */|
>> +# ifdef __KERNEL__
>> +	return (cpu_has_feature(CONFIG_ALTIVEC) &&
>> +		cpu_has_feature(CPU_FTR_ARCH_207S));
> I assume this is future-proof - an ISA 3.00 cpu will advertise 2.07S
> compat?
>
>> +# else
>> +	return 1;
>> +#endif
>> +
>> +}
>> +#endif
>> +
>> +const struct raid6_calls raid6_vpermxor$# = {
>> +	raid6_vpermxor$#_gen_syndrome,
>> +	NULL,
>> +	raid6_have_altivec_vpermxor,
>> +	"vpermxor$#",
>> +	0
>> +};
>> +#endif
>> -- 
>> 2.9.3
>
> Apart from that this patch looks good and I expect I will be able to
> formally Review v2.
>
> Regards,
> Daniel

^ permalink raw reply

* Re: [PATCH 1/7] Makefile, LLVM: add -no-integrated-as to KBUILD_[AC]FLAGS
From: Matthias Kaehlcke @ 2017-04-03 22:49 UTC (permalink / raw)
  To: Michael Davidson
  Cc: Michal Marek, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Herbert Xu, David S. Miller, Shaohua Li, Alexander Potapenko,
	Dmitry Vyukov, x86, linux-kbuild, linux-kernel, linux-crypto,
	linux-raid
In-Reply-To: <20170317001520.85223-2-md@google.com>

El Thu, Mar 16, 2017 at 05:15:14PM -0700 Michael Davidson ha dit:

> Add -no-integrated-as to KBUILD_AFLAGS and KBUILD_CFLAGS
> for clang.
> 
> Signed-off-by: Michael Davidson <md@google.com>
> ---
>  Makefile | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Makefile b/Makefile
> index b841fb36beb2..b21fd0ca2946 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -704,6 +704,8 @@ KBUILD_CFLAGS += $(call cc-disable-warning, tautological-compare)
>  # See modpost pattern 2
>  KBUILD_CFLAGS += $(call cc-option, -mno-global-merge,)
>  KBUILD_CFLAGS += $(call cc-option, -fcatch-undefined-behavior)
> +KBUILD_CFLAGS += $(call cc-option, -no-integrated-as)
> +KBUILD_AFLAGS += $(call cc-option, -no-integrated-as)
>  else
>  
>  # These warnings generated too much noise in a regular build.

Ping, any feedback on this patch?

Cheers

Matthias

^ permalink raw reply

* Re: [PATCH 3/7] x86, LLVM: suppress clang warnings about unaligned accesses
From: Matthias Kaehlcke @ 2017-04-03 23:01 UTC (permalink / raw)
  To: hpa
  Cc: Michael Davidson, Michal Marek, Thomas Gleixner, Ingo Molnar,
	Herbert Xu, David S. Miller, Shaohua Li, Alexander Potapenko,
	Dmitry Vyukov, x86, linux-kbuild, linux-kernel, linux-crypto,
	linux-raid
In-Reply-To: <FB827022-C732-432A-8F31-734C88EBF7A4@zytor.com>

El Fri, Mar 17, 2017 at 04:50:19PM -0700 hpa@zytor.com ha dit:

> On March 16, 2017 5:15:16 PM PDT, Michael Davidson <md@google.com> wrote:
> >Suppress clang warnings about potential unaliged accesses
> >to members in packed structs. This gets rid of almost 10,000
> >warnings about accesses to the ring 0 stack pointer in the TSS.
> >
> >Signed-off-by: Michael Davidson <md@google.com>
> >---
> > arch/x86/Makefile | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> >diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> >index 894a8d18bf97..7f21703c475d 100644
> >--- a/arch/x86/Makefile
> >+++ b/arch/x86/Makefile
> >@@ -128,6 +128,11 @@ endif
> >         KBUILD_CFLAGS += $(call cc-option,-maccumulate-outgoing-args)
> > endif
> > 
> >+ifeq ($(cc-name),clang)
> >+# Suppress clang warnings about potential unaligned accesses.
> >+KBUILD_CFLAGS += $(call cc-disable-warning, address-of-packed-member)
> >+endif
> >+
> > ifdef CONFIG_X86_X32
> > 	x32_ld_ok := $(call try-run,\
> > 			/bin/echo -e '1: .quad 1b' | \
> 
> Why conditional on clang?

My understanding is that this warning is clang specific, it is not
listed on https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html

Cheers

Matthias

^ permalink raw reply

* Re: [PATCH v1] mdadm/grow: reshape would be stuck from raid1 to raid5
From: Zhilong @ 2017-04-04  5:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jes.Sorensen, linux-raid
In-Reply-To: <87d1cukszu.fsf@notabene.neil.brown.name>



Send from iPhone

> 在 2017年4月3日，12:36，NeilBrown <neilb@suse.com> 写道：
> 
>> On Thu, Mar 30 2017, Zhilong Liu wrote:
>> 
>> systemctl doesn't interpret mdadm-grow-continue@.service
>> correctly due to the wrong argument provided in [service],
>> it should be corrected %I as %i. Otherwise, if the service
>> cannot start by systemctl and the reshap progress would be
>> stuck all time when grows array from raid1 to raid5.
>> 
>> reproduce steps:
>> ./mdadm -CR /dev/md0 -l1 -b internal -n2 /dev/loop[0-1]
>> ./mdadm --grow /dev/md0 -l5 -n3 -a /dev/loop2
>> 
>> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>> ---
>> systemd/mdadm-grow-continue@.service | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/systemd/mdadm-grow-continue@.service b/systemd/mdadm-grow-continue@.service
>> index 5c667d2..882bc0b 100644
>> --- a/systemd/mdadm-grow-continue@.service
>> +++ b/systemd/mdadm-grow-continue@.service
>> @@ -10,7 +10,7 @@ Description=Manage MD Reshape on /dev/%I
>> DefaultDependencies=no
>> 
>> [Service]
>> -ExecStart=BINDIR/mdadm --grow --continue /dev/%I
>> +ExecStart=BINDIR/mdadm --grow --continue /dev/%i
> 
> Do you know why this makes a difference?  I don't think it should.
> man systemd.unit says that "%i" is the "Instance name" while "%I" is the
> "Unescaped instance name".
> 
> The Instance name here is something like "md0" so there is nothing to
> escape.
> 
> I would rather not change it unless we know exactly why it is broken,
> and I don't find your explanation to be convincing.
> 

Exactly, you're correct, in this case, %i and %I are the same. The root cause is the ExecStart part, all the path name should be verified by systemd-escape，such as:
/sbin/mdadm should be corrected as sbin-mdadm, and /dev/%I should be -dev-%I. Thus I'm sorry for this patch, I do agree with you not to change it. And say sorry to Jes.

Thanks 
-zhilong 

> NeilBrown
> 
> 
>> StandardInput=null
>> StandardOutput=null
>> StandardError=null
>> -- 
>> 2.10.2
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* mdadm:compiled warning in mdadm.c:1974:29 treats as errors
From: Zhilong @ 2017-04-04  5:31 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-raid

Hi,jes;
  I have found that there is mdfd returns uninitialized warning in mdadm.c. I cannot paste the logs due to I'm on holiday. Please try to issue make everything, you should see it.

Thanks,
-Zhilong 

Send from iPhone

^ permalink raw reply

* Re: [PATCH v1] mdadm/grow: reshape would be stuck from raid1 to raid5
From: Zhilong @ 2017-04-04  5:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jes.Sorensen, linux-raid
In-Reply-To: <5B908F2B-D84A-4C4E-A0C6-863B6E793D75@suse.com>



Send from iPhone

> 在 2017年4月4日，13:07，Zhilong <zlliu@suse.com> 写道：
> 
> 
> 
> Send from iPhone
> 
>>> 在 2017年4月3日，12:36，NeilBrown <neilb@suse.com> 写道：
>>> 
>>> On Thu, Mar 30 2017, Zhilong Liu wrote:
>>> 
>>> systemctl doesn't interpret mdadm-grow-continue@.service
>>> correctly due to the wrong argument provided in [service],
>>> it should be corrected %I as %i. Otherwise, if the service
>>> cannot start by systemctl and the reshap progress would be
>>> stuck all time when grows array from raid1 to raid5.
>>> 
>>> reproduce steps:
>>> ./mdadm -CR /dev/md0 -l1 -b internal -n2 /dev/loop[0-1]
>>> ./mdadm --grow /dev/md0 -l5 -n3 -a /dev/loop2
>>> 
>>> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>>> ---
>>> systemd/mdadm-grow-continue@.service | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/systemd/mdadm-grow-continue@.service b/systemd/mdadm-grow-continue@.service
>>> index 5c667d2..882bc0b 100644
>>> --- a/systemd/mdadm-grow-continue@.service
>>> +++ b/systemd/mdadm-grow-continue@.service
>>> @@ -10,7 +10,7 @@ Description=Manage MD Reshape on /dev/%I
>>> DefaultDependencies=no
>>> 
>>> [Service]
>>> -ExecStart=BINDIR/mdadm --grow --continue /dev/%I
>>> +ExecStart=BINDIR/mdadm --grow --continue /dev/%i
>> 
>> Do you know why this makes a difference?  I don't think it should.
>> man systemd.unit says that "%i" is the "Instance name" while "%I" is the
>> "Unescaped instance name".
>> 
>> The Instance name here is something like "md0" so there is nothing to
>> escape.
>> 
>> I would rather not change it unless we know exactly why it is broken,
>> and I don't find your explanation to be convincing.
>> 
> 
> Exactly, you're correct, in this case, %i and %I are the same. The root cause is the ExecStart part, all the path name should be verified by systemd-escape，such as:
> /sbin/mdadm should be corrected as sbin-mdadm, and /dev/%I should be -dev-%I. Thus I'm sorry for this patch, I do agree with you not to change it. And say sorry to Jes.
> 

How about modifying this patch as:

ExecStart=-sbin-mdadm --grow --continue -dev-%I

Thanks,
Zhilong 

> Thanks 
> -zhilong 
> 
>> NeilBrown
>> 
>> 
>>> StandardInput=null
>>> StandardOutput=null
>>> StandardError=null
>>> -- 
>>> 2.10.2
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply

* Re: [PATCH v1] mdadm/grow: reshape would be stuck from raid1 to raid5
From: NeilBrown @ 2017-04-04  6:30 UTC (permalink / raw)
  To: Zhilong; +Cc: Jes.Sorensen, linux-raid
In-Reply-To: <99E3B202-848D-49F2-BC8A-966A631DE508@suse.com>

[-- Attachment #1: Type: text/plain, Size: 2882 bytes --]

On Tue, Apr 04 2017, Zhilong wrote:

> Send from iPhone
>
>> 在 2017年4月4日，13:07，Zhilong <zlliu@suse.com> 写道：
>> 
>> 
>> 
>> Send from iPhone
>> 
>>>> 在 2017年4月3日，12:36，NeilBrown <neilb@suse.com> 写道：
>>>> 
>>>> On Thu, Mar 30 2017, Zhilong Liu wrote:
>>>> 
>>>> systemctl doesn't interpret mdadm-grow-continue@.service
>>>> correctly due to the wrong argument provided in [service],
>>>> it should be corrected %I as %i. Otherwise, if the service
>>>> cannot start by systemctl and the reshap progress would be
>>>> stuck all time when grows array from raid1 to raid5.
>>>> 
>>>> reproduce steps:
>>>> ./mdadm -CR /dev/md0 -l1 -b internal -n2 /dev/loop[0-1]
>>>> ./mdadm --grow /dev/md0 -l5 -n3 -a /dev/loop2
>>>> 
>>>> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>>>> ---
>>>> systemd/mdadm-grow-continue@.service | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>> 
>>>> diff --git a/systemd/mdadm-grow-continue@.service b/systemd/mdadm-grow-continue@.service
>>>> index 5c667d2..882bc0b 100644
>>>> --- a/systemd/mdadm-grow-continue@.service
>>>> +++ b/systemd/mdadm-grow-continue@.service
>>>> @@ -10,7 +10,7 @@ Description=Manage MD Reshape on /dev/%I
>>>> DefaultDependencies=no
>>>> 
>>>> [Service]
>>>> -ExecStart=BINDIR/mdadm --grow --continue /dev/%I
>>>> +ExecStart=BINDIR/mdadm --grow --continue /dev/%i
>>> 
>>> Do you know why this makes a difference?  I don't think it should.
>>> man systemd.unit says that "%i" is the "Instance name" while "%I" is the
>>> "Unescaped instance name".
>>> 
>>> The Instance name here is something like "md0" so there is nothing to
>>> escape.
>>> 
>>> I would rather not change it unless we know exactly why it is broken,
>>> and I don't find your explanation to be convincing.
>>> 
>> 
>> Exactly, you're correct, in this case, %i and %I are the same. The root cause is the ExecStart part, all the path name should be verified by systemd-escape，such as:
>> /sbin/mdadm should be corrected as sbin-mdadm, and /dev/%I should be -dev-%I. Thus I'm sorry for this patch, I do agree with you not to change it. And say sorry to Jes.
>> 
>
> How about modifying this patch as:
>
> ExecStart=-sbin-mdadm --grow --continue -dev-%I
>

Why do you think anything needs changing here?

I have a tumbleweed install with the standard
mdadm-grow-continue@.server
file. i.e.

  ExecStart=/sbin/mdadm --grow --continue /dev/%I 

I run
  strace -f -s 1000 -p 1 -o /tmp/strace

in one window, then

  systemctl start mdadm-grow-continue@md0.service

in another.

Then

 grep execve /tmp/strace

shows:

18680 execve("/sbin/mdadm", ["/sbin/mdadm", "--grow", "--continue", "/dev/md0"], [/* 3 vars */] <unfinished ...>

which shows that mdadm is being correctly.

There is nothing to fix here that I can see.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04  8:13 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <877f31kwti.fsf@notabene.neil.brown.name>

Hi, Neil

On 04/03/2017 11:25 PM, NeilBrown wrote:
> On Mon, Apr 03 2017, Michael Wang wrote:
> 
>> blk_attempt_plug_merge() try to merge bio into request and chain them
>> by 'bi_next', while after the bio is done inside request, we forgot to
>> reset the 'bi_next'.
>>
>> This lead into BUG while removing all the underlying devices from md-raid1,
>> the bio once go through:
>>
>>   md_do_sync()
>>     sync_request()
>>       generic_make_request()
> 
> This is a read request from the "first" device.
> 
>>         blk_queue_bio()
>>           blk_attempt_plug_merge()
>>             CHAINED HERE
>>
>> will keep chained and reused by:
>>
>>   raid1d()
>>     sync_request_write()
>>       generic_make_request()
> 
> This is a write request to some other device, isn't it?
> 
> If sync_request_write() is using a bio that has already been used, it
> should call bio_reset() and fill in the details again.
> However I don't see how that would happen.
> Can you give specific details on the situation that triggers the bug?

We have storage side mapping lv through scst to server, on server side
we assemble them into multipath device, and then assemble these dm into
two raid1.

The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
side we unmap all the lv (could during mkfs or fio), then on server side
we hit the BUG (reproducible).

The path of bio was confirmed by add tracing, it is reused in sync_request_write()
with 'bi_next' once chained inside blk_attempt_plug_merge().

We also tried to reset the bi_next inside sync_request_write() before
generic_make_request() which also works.

The testing was done with 4.4, but we found upstream also left bi_next
chained after done in request, thus we post this RFC.

Regarding raid1, we haven't found the place on path where the bio was
reset... where does it supposed to be?

BTW the fix_sync_read_error() also invoked and succeed before trigger
the BUG.

Regards,
Michael Wang

> 
> Thanks,
> NeilBrown
> 
> 
>>         BUG_ON(bio->bi_next)
>>
>> After reset the 'bi_next' this can no longer happen.
>>
>> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
>> ---
>>  block/blk-core.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 43b7d06..91223b2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>>                 struct bio *bio = req->bio;
>>                 unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>>
>> -               if (bio_bytes == bio->bi_iter.bi_size)
>> +               if (bio_bytes == bio->bi_iter.bi_size) {
>>                         req->bio = bio->bi_next;
>> +                       bio->bi_next = NULL;
>> +               }
>>
>>                 req_bio_endio(req, bio, bio_bytes, error);
>>
>> -- 
>> 2.5.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox