Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [PATCH] mdadm/grow: reshape would be stuck from raid1 to raid5
From: Zhilong Liu @ 2017-03-20  5:20 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Zhilong Liu

it would be stuck at the beginning of reshape progress
when grows array from raid1 to raid5, correct the name
of mdadm-grow-continue@.service in continue_via_systemd.

reproduce steps:
./mdadm -CR /dev/md0 -l1 -b internal -n2 /dev/loop[0-1]
./mdadm --grow /dev/md0 -l5 -n3 -a /dev/loop2

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Grow.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/Grow.c b/Grow.c
index 455c5f9..10c02a1 100755
--- a/Grow.c
+++ b/Grow.c
@@ -2808,13 +2808,11 @@ static int continue_via_systemd(char *devnm)
 		 */
 		close(2);
 		open("/dev/null", O_WRONLY);
-		snprintf(pathbuf, sizeof(pathbuf), "mdadm-grow-continue@%s.service",
-			 devnm);
+		snprintf(pathbuf, sizeof(pathbuf), "mdadm-grow-continue@.service");
 		status = execl("/usr/bin/systemctl", "systemctl",
 			       "start",
 			       pathbuf, NULL);
-		status = execl("/bin/systemctl", "systemctl", "start",
-			       pathbuf, NULL);
+		pr_err("/usr/bin/systemctl %s got failure\n", pathbuf);
 		exit(1);
 	case -1: /* Just do it ourselves. */
 		break;
-- 
2.6.6


^ permalink raw reply related

* [PATCH] mdadm:reminded external bitmap only supports ext[2-4] filesystem
From: Zhilong Liu @ 2017-03-20  5:20 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Zhilong Liu

mdadm: ensure that the external bitmap_file is
stored by ext[2-4] file system, because bmap()
of linux/driver/md/bitmap.c only implements in
inode.c of ext[2-4]. it's better to make users
aware of this scenario and give a prompt.

steps:
zlliu:/home/liu/git-tree/mdadm # df -T /mnt/2
Filesystem     Type  1K-blocks     Used Available Use% Mounted on
/dev/sda5      btrfs 103078176 11110620  90902628  11% /

./mdadm -CR /dev/md0 -l1 -b /mnt/2 -n2 /dev/loop[0-1]

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Create.c |  5 -----
 mdadm.c  | 15 +++++++++++++++
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/Create.c b/Create.c
index 50ec85e..beec29f 100644
--- a/Create.c
+++ b/Create.c
@@ -827,11 +827,6 @@ int Create(struct supertype *st, char *mddev,
 			goto abort_locked;
 		}
 		bitmap_fd = open(s->bitmap_file, O_RDWR);
-		if (bitmap_fd < 0) {
-			pr_err("weird: %s cannot be openned\n",
-				s->bitmap_file);
-			goto abort_locked;
-		}
 		if (ioctl(mdfd, SET_BITMAP_FILE, bitmap_fd) < 0) {
 			pr_err("Cannot set bitmap file for %s: %s\n",
 				mddev, strerror(errno));
diff --git a/mdadm.c b/mdadm.c
index fcb33d1..566051f 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1149,6 +1149,21 @@ int main(int argc, char *argv[])
 			    strcmp(optarg, "none") == 0 ||
 			    strchr(optarg, '/') != NULL) {
 				s.bitmap_file = optarg;
+				if (strchr(s.bitmap_file, '/') != NULL) {
+					bitmap_fd = open(s.bitmap_file, O_RDWR);
+					if (bitmap_fd < 0) {
+						pr_err("weird: %s cannot be openned\n", s.bitmap_file);
+						exit(2);
+					}
+					close(bitmap_fd);
+					struct statfs ext_bitmap;
+					statfs(s.bitmap_file, &ext_bitmap);
+					if (ext_bitmap.f_type != 0xEF53){
+						pr_err("external bitmap only supports ext[2-4] filesystem, %s.\n",
+							s.bitmap_file);
+						exit(2);
+					}
+				}
 				continue;
 			}
 			if (strcmp(optarg, "clustered") == 0) {
-- 
2.6.6


^ permalink raw reply related

* [PATCH] mdadm:it doesn't make sense to set --bitmap twice
From: Zhilong Liu @ 2017-03-20  5:21 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Zhilong Liu

mdadm.c: it doesn't make sense to set --bitmap twice.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 mdadm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mdadm.c b/mdadm.c
index 566051f..2d8498f 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1145,6 +1145,10 @@ int main(int argc, char *argv[])
 		case O(CREATE,Bitmap): /* here we create the bitmap */
 		case O(GROW,'b'):
 		case O(GROW,Bitmap):
+			if (s.bitmap_file) {
+				pr_err("bitmap cannot be set twice. Second value: %s.\n", optarg);
+				exit(2);
+			}
 			if (strcmp(optarg, "internal") == 0 ||
 			    strcmp(optarg, "none") == 0 ||
 			    strchr(optarg, '/') != NULL) {
-- 
2.6.6


^ permalink raw reply related

* [PATCH] mdadm/mdmon:deleted the abort_reshape never invoked
From: Zhilong Liu @ 2017-03-20  5:21 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Zhilong Liu

mdmon.c: abort_reshape() has implemented in Grow.c,
this function doesn't make a lot of sense here.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 mdmon.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/mdmon.c b/mdmon.c
index e4b73d9..95e9bba 100644
--- a/mdmon.c
+++ b/mdmon.c
@@ -580,11 +580,6 @@ int restore_stripes(int *dest, unsigned long long *offsets,
 	return 1;
 }
 
-void abort_reshape(struct mdinfo *sra)
-{
-	return;
-}
-
 int save_stripes(int *source, unsigned long long *offsets,
 		 int raid_disks, int chunk_size, int level, int layout,
 		 int nwrites, int *dest,
-- 
2.6.6


^ permalink raw reply related

* [PATCH] mdadm/Monitor:triggers core dump when stat2devnm return NULL
From: Zhilong Liu @ 2017-03-20  5:21 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Zhilong Liu

Monitor: ensure that the device should be a block
device when uses --wait parameter, such as the 'f'
and 'd' type file would be triggered core dumped.
such as: ./mdadm --wait /dev/md/

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Monitor.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Monitor.c b/Monitor.c
index 802a9d9..f8850d3 100644
--- a/Monitor.c
+++ b/Monitor.c
@@ -1002,7 +1002,12 @@ int Wait(char *dev)
 			strerror(errno));
 		return 2;
 	}
-	strcpy(devnm, stat2devnm(&stb));
+	char *tmp = stat2devnm(&stb);
+	if (!tmp) {
+		pr_err("%s is not a block device.\n", dev);
+		return 2;
+	}
+	strcpy(devnm, tmp);
 
 	while(1) {
 		struct mdstat_ent *ms = mdstat_read(1, 0);
-- 
2.6.6


^ permalink raw reply related

* [PATCH] mdadm/Monitor:dev should be a block file when use --waitclean
From: Zhilong Liu @ 2017-03-20  5:21 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Zhilong Liu

Monitor: mdadm --wait-clean /dev/mdX, the dev should
be a block file, otherwise fd2devnm returns NULL and
then triggers core dumped.

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 Monitor.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Monitor.c b/Monitor.c
index f8850d3..5a2b5ca 100644
--- a/Monitor.c
+++ b/Monitor.c
@@ -1065,7 +1065,17 @@ int WaitClean(char *dev, int sock, int verbose)
 	struct mdinfo *mdi;
 	int rv = 1;
 	char devnm[32];
+	struct stat stb;
 
+	if (stat(dev, &stb) != 0) {
+		pr_err("Cannot find %s: %s\n", dev,
+			strerror(errno));
+		return 2;
+	}
+	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
+		pr_err("%s is not a block device.\n", dev);
+		return 2;
+	}
 	fd = open(dev, O_RDONLY);
 	if (fd < 0) {
 		if (verbose)
-- 
2.6.6


^ permalink raw reply related

* Re: on assembly and recovery of a hardware RAID
From: NeilBrown @ 2017-03-20  5:34 UTC (permalink / raw)
  To: Alfred Matthews; +Cc: linux-raid
In-Reply-To: <CAAZLhTcoU232uBs2zcK6TFj67Jru5oeNE_F0T2fQR6Un66OPXA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

On Sat, Mar 18 2017, Alfred Matthews wrote:

> I've switched to the backup drives which are clones of the first, now,
> so destructive operations are ok if necessary. Also signatures will
> have changed.
>
> 0. Hm. Evidently the system is JHFS instead of HFS+, per the output
> below. Unsure if there is separate tooling in Debian.
>
> 1. Mount via
>
> mdadm --build /dev/md0 --level=0 -n2 --chunk=512K /dev/sdc2 /dev/sdb2
>
> works just fine. Thanks!
>
> 2. I'm still sticking with the non-destructive, non-mount edits for
> now. So I can report the following:
>
> hpfsck -v /dev/md0 | cat >> hpfsck_output.txt
>
> yields some stuff probably more enlightening than prior.

This is promising until:


> *** Checking Backup Volume Header:
> Unexpected Volume signature '  ' expected 'H+'

Here the backup volume header, which is 2 blocks (blocks are 8K) from
the end of the device, looks wrong.
This probably means the chunk size is wrong.
I would suggest trying different chunksizes, starting at 4K and
doubling, until this message goes away.
That still might not be the correct chunk size, so I would continue up
to several megabytes and find all the chunksizes that seem to work.
Then look at what else hpfsck says on those.

BTW, this:
> Invalid total blocks 2BA8CC68, expected 0 Done ***
is not a real problem, just some odd code.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] mdadm:reminded external bitmap only supports ext[2-4] filesystem
From: Zhilong Liu @ 2017-03-20  5:40 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid
In-Reply-To: <1489987247-22681-1-git-send-email-zlliu@suse.com>


On 03/20/2017 01:20 PM, Zhilong Liu wrote:
> mdadm: ensure that the external bitmap_file is
> stored by ext[2-4] file system, because bmap()
> of linux/driver/md/bitmap.c only implements in
> inode.c of ext[2-4]. it's better to make users
> aware of this scenario and give a prompt.
>
> steps:
> zlliu:/home/liu/git-tree/mdadm # df -T /mnt/2
> Filesystem     Type  1K-blocks     Used Available Use% Mounted on
> /dev/sda5      btrfs 103078176 11110620  90902628  11% /
>
> ./mdadm -CR /dev/md0 -l1 -b /mnt/2 -n2 /dev/loop[0-1]

   as the purpose to improve the prompt when using external bitmap
mode, actually I disagree this patch for mdadm, it doesn't need to do
this action in userland.
   For errno rule, bmap() returned EINVAL indeed, but finally mdadm
received the EINVAL by RUN_ARRAY.
   I think it would be more user-friendly if prints one prompt and returned
EINVAL at the same time when the bmap() got failure, so that user can
be easy to know where the EINVAL comes.
Such as:

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 9fb2cca..0bff96b 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -381,6 +381,7 @@ static int read_page(struct file *file, unsigned 
long index,
                         bh->b_blocknr = bmap(inode, block);
                         if (bh->b_blocknr == 0) {
                                 /* Cannot use this file! */
+                               pr_err("external bitmap only supports to 
write into a ext[2-4] file.\n");
                                 ret = -EINVAL;
                                 goto out;
                         }

    keep waiting for your comments for this scenario.

Thanks,
-Zhilong
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>   Create.c |  5 -----
>   mdadm.c  | 15 +++++++++++++++
>   2 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/Create.c b/Create.c
> index 50ec85e..beec29f 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -827,11 +827,6 @@ int Create(struct supertype *st, char *mddev,
>   			goto abort_locked;
>   		}
>   		bitmap_fd = open(s->bitmap_file, O_RDWR);
> -		if (bitmap_fd < 0) {
> -			pr_err("weird: %s cannot be openned\n",
> -				s->bitmap_file);
> -			goto abort_locked;
> -		}
>   		if (ioctl(mdfd, SET_BITMAP_FILE, bitmap_fd) < 0) {
>   			pr_err("Cannot set bitmap file for %s: %s\n",
>   				mddev, strerror(errno));
> diff --git a/mdadm.c b/mdadm.c
> index fcb33d1..566051f 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -1149,6 +1149,21 @@ int main(int argc, char *argv[])
>   			    strcmp(optarg, "none") == 0 ||
>   			    strchr(optarg, '/') != NULL) {
>   				s.bitmap_file = optarg;
> +				if (strchr(s.bitmap_file, '/') != NULL) {
> +					bitmap_fd = open(s.bitmap_file, O_RDWR);
> +					if (bitmap_fd < 0) {
> +						pr_err("weird: %s cannot be openned\n", s.bitmap_file);
> +						exit(2);
> +					}
> +					close(bitmap_fd);
> +					struct statfs ext_bitmap;
> +					statfs(s.bitmap_file, &ext_bitmap);
> +					if (ext_bitmap.f_type != 0xEF53){
> +						pr_err("external bitmap only supports ext[2-4] filesystem, %s.\n",
> +							s.bitmap_file);
> +						exit(2);
> +					}
> +				}
>   				continue;
>   			}
>   			if (strcmp(optarg, "clustered") == 0) {


^ permalink raw reply related

* Re: [PATCH v3 3/6] imsm: PPL support
From: Artur Paszkiewicz @ 2017-03-20  8:07 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid
In-Reply-To: <wrfj4lyrk6je.fsf@gmail.com>

On 03/17/2017 09:11 PM, jes.sorensen@gmail.com wrote:
> Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
>> Enable creating and assembling IMSM raid5 arrays with PPL. Update the
>> IMSM metadata format to include new fields used for PPL.
>>
>> Add structures for PPL metadata. They are used also by super1 and shared
>> with the kernel, so put them in md_p.h.
>>
>> Write the initial empty PPL header when creating an array. When
>> assembling an array with PPL, validate the PPL header and in case it is
>> not correct allow to overwrite it if --force was provided.
>>
>> Write the PPL location and size for a device to the new rdev sysfs
>> attributes 'ppl_sector' and 'ppl_size'. Enable PPL in the kernel by
>> writing to 'consistency_policy' before the array is activated.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Assemble.c    |  49 +++++++++++
>>  Makefile      |   5 +-
>>  md_p.h        |  25 ++++++
>>  mdadm.h       |   6 ++
>>  super-intel.c | 274 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>>  sysfs.c       |  14 +++
>>  6 files changed, 349 insertions(+), 24 deletions(-)
>>
>> diff --git a/Assemble.c b/Assemble.c
>> index 3da09033..8e55b49f 100644
>> --- a/Assemble.c
>> +++ b/Assemble.c
>> @@ -1942,6 +1942,55 @@ int assemble_container_content(struct supertype *st, int mdfd,
>>  	map_update(NULL, fd2devnm(mdfd), content->text_version,
>>  		   content->uuid, chosen_name);
>>  
>> +	if (content->consistency_policy == CONSISTENCY_POLICY_PPL &&
>> +	    st->ss->validate_ppl) {
>> +		content->array.state |= 1;
>> +		err = 0;
>> +
>> +		for (dev = content->devs; dev; dev = dev->next) {
>> +			int dfd;
>> +			char *devpath;
>> +			int ret;
>> +
>> +			ret = st->ss->validate_ppl(st, content, dev);
>> +			if (ret == 0)
>> +				continue;
>> +
>> +			if (ret < 0) {
>> +				err = 1;
>> +				break;
>> +			}
>> +
>> +			if (!c->force) {
>> +				pr_err("%s contains invalid PPL - consider --force or --update-subarray with --update=no-ppl\n",
>> +					chosen_name);
>> +				content->array.state &= ~1;
>> +				avail[dev->disk.raid_disk] = 0;
>> +				break;
>> +			}
>> +
>> +			/* have --force - overwrite the invalid ppl */
>> +			devpath = map_dev(dev->disk.major, dev->disk.minor, 0);
>> +			dfd = dev_open(devpath, O_RDWR);
>> +			if (dfd < 0) {
>> +				pr_err("Failed to open %s\n", devpath);
>> +				err = 1;
>> +				break;
>> +			}
>> +
>> +			err = st->ss->write_init_ppl(st, content, dfd);
>> +			close(dfd);
>> +
>> +			if (err)
>> +				break;
>> +		}
>> +
>> +		if (err) {
>> +			free(avail);
>> +			return err;
>> +		}
>> +	}
>> +
>>  	if (enough(content->array.level, content->array.raid_disks,
>>  		   content->array.layout, content->array.state & 1, avail) == 0) {
>>  		if (c->export && result)
>> diff --git a/Makefile b/Makefile
>> index a6f464c3..49da491f 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -146,7 +146,7 @@ MON_OBJS = mdmon.o monitor.o managemon.o util.o maps.o mdstat.o sysfs.o \
>>  	Kill.o sg_io.o dlink.o ReadMe.o super-intel.o \
>>  	super-mbr.o super-gpt.o \
>>  	super-ddf.o sha1.o crc32.o msg.o bitmap.o xmalloc.o \
>> -	platform-intel.o probe_roms.o
>> +	platform-intel.o probe_roms.o crc32c.o
>>  
>>  MON_SRCS = $(patsubst %.o,%.c,$(MON_OBJS))
>>  
>> @@ -156,7 +156,8 @@ STATICOBJS = pwgr.o
>>  ASSEMBLE_SRCS := mdassemble.c Assemble.c Manage.c config.c policy.c dlink.c util.c \
>>  	maps.c lib.c xmalloc.c \
>>  	super0.c super1.c super-ddf.c super-intel.c sha1.c crc32.c sg_io.c mdstat.c \
>> -	platform-intel.c probe_roms.c sysfs.c super-mbr.c super-gpt.c mapfile.c
>> +	platform-intel.c probe_roms.c sysfs.c super-mbr.c super-gpt.c mapfile.c \
>> +	crc32c.o
> 
> Hi Artur,
> 
> This looks odd - sure you don't mean crc32c.c ?

Of course this is a mistake, it should be crc32c.c. Sorry for that. But
surprisingly it builds correctly.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH V1] md/raid10: refactor some codes from raid10_write_request
From: Guoqing Jiang @ 2017-03-20  8:28 UTC (permalink / raw)
  To: kbuild test robot; +Cc: kbuild-all, linux-raid, shli, neilb
In-Reply-To: <201703201246.zBo2ZrYb%fengguang.wu@intel.com>



On 03/20/2017 01:05 PM, kbuild test robot wrote:
> Hi Guoqing,
>
> [auto build test ERROR on next-20170310]
> [also build test ERROR on v4.11-rc3]
> [cannot apply to md/for-next v4.9-rc8 v4.9-rc7 v4.9-rc6]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/md-raid10-refactor-some-codes-from-raid10_write_request/20170320-124148
> config: x86_64-randconfig-x004-201712 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>          # save the attached .config to linux build tree
>          make ARCH=x86_64
>
> All errors (new ones prefixed by >>):
>
>     drivers/md/raid10.c: In function 'raid10_write_one_disk':
>>> drivers/md/raid10.c:1203:29: error: 'i' undeclared (first use in this function)
>       int devnum = r10_bio->devs[i].devnum;
>                                  ^
>     drivers/md/raid10.c:1203:29: note: each undeclared identifier is reported only once for each function it appears in
>
> vim +/i +1203 drivers/md/raid10.c
>
>    1197		const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
>    1198		unsigned long flags;
>    1199		struct blk_plug_cb *cb;
>    1200		struct raid10_plug_cb *plug = NULL;
>    1201		struct r10conf *conf = mddev->private;
>    1202		struct md_rdev *rdev;
>> 1203		int devnum = r10_bio->devs[i].devnum;

Oops, I forgot to compile it, :-

Thanks,
Guoqing

^ permalink raw reply

* [PATCH V2] md/raid10: refactor some codes from raid10_write_request
From: Guoqing Jiang @ 2017-03-20  9:46 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang
In-Reply-To: <1489743917-10895-1-git-send-email-gqjiang@suse.com>

Previously, we clone both bio and repl_bio in raid10_write_request,
then add the cloned bio to plug->pending or conf->pending_bio_list
based on plug or not, and most of the logics are same for the two
conditions.

So introduce raid10_write_one_disk for it, and use replacement parameter
to distinguish the difference. No functional changes in the patch.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
Changes from V1:
1. fix compile issues reported by kbuild test
2. also fix some warning infos about over 80 characters

Changes from RFC:
1. rename handle_clonebio to raid10_write_one_disk
2. s/i/n_copy/ and s/int replacement/bool replacement/

 drivers/md/raid10.c | 175 ++++++++++++++++++++++------------------------------
 1 file changed, 75 insertions(+), 100 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b1b1f982a722..69045b94a9ab 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1188,18 +1188,82 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	return;
 }
 
-static void raid10_write_request(struct mddev *mddev, struct bio *bio,
-				 struct r10bio *r10_bio)
+static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
+				  struct bio *bio, bool replacement,
+				  int n_copy, int max_sectors)
 {
-	struct r10conf *conf = mddev->private;
-	int i;
 	const int op = bio_op(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
 	const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
 	unsigned long flags;
-	struct md_rdev *blocked_rdev;
 	struct blk_plug_cb *cb;
 	struct raid10_plug_cb *plug = NULL;
+	struct r10conf *conf = mddev->private;
+	struct md_rdev *rdev;
+	int devnum = r10_bio->devs[n_copy].devnum;
+	struct bio *mbio;
+
+	if (replacement) {
+		rdev = conf->mirrors[devnum].replacement;
+		if (rdev == NULL) {
+			/* Replacement just got moved to main 'rdev' */
+			smp_mb();
+			rdev = conf->mirrors[devnum].rdev;
+		}
+	} else
+		rdev = conf->mirrors[devnum].rdev;
+
+	mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
+	bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector, max_sectors);
+	if (replacement)
+		r10_bio->devs[n_copy].repl_bio = mbio;
+	else
+		r10_bio->devs[n_copy].bio = mbio;
+
+	mbio->bi_iter.bi_sector	= (r10_bio->devs[n_copy].addr +
+				   choose_data_offset(r10_bio, rdev));
+	mbio->bi_bdev = rdev->bdev;
+	mbio->bi_end_io	= raid10_end_write_request;
+	bio_set_op_attrs(mbio, op, do_sync | do_fua);
+	if (!replacement && test_bit(FailFast,
+				     &conf->mirrors[devnum].rdev->flags)
+			 && enough(conf, devnum))
+		mbio->bi_opf |= MD_FAILFAST;
+	mbio->bi_private = r10_bio;
+
+	if (conf->mddev->gendisk)
+		trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
+				      mbio, disk_devt(conf->mddev->gendisk),
+				      r10_bio->sector);
+	/* flush_pending_writes() needs access to the rdev so...*/
+	mbio->bi_bdev = (void *)rdev;
+
+	atomic_inc(&r10_bio->remaining);
+
+	cb = blk_check_plugged(raid10_unplug, mddev, sizeof(*plug));
+	if (cb)
+		plug = container_of(cb, struct raid10_plug_cb, cb);
+	else
+		plug = NULL;
+	spin_lock_irqsave(&conf->device_lock, flags);
+	if (plug) {
+		bio_list_add(&plug->pending, mbio);
+		plug->pending_cnt++;
+	} else {
+		bio_list_add(&conf->pending_bio_list, mbio);
+		conf->pending_count++;
+	}
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	if (!plug)
+		md_wakeup_thread(mddev->thread);
+}
+
+static void raid10_write_request(struct mddev *mddev, struct bio *bio,
+				 struct r10bio *r10_bio)
+{
+	struct r10conf *conf = mddev->private;
+	int i;
+	struct md_rdev *blocked_rdev;
 	sector_t sectors;
 	int sectors_handled;
 	int max_sectors;
@@ -1402,101 +1466,12 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 	bitmap_startwrite(mddev->bitmap, r10_bio->sector, r10_bio->sectors, 0);
 
 	for (i = 0; i < conf->copies; i++) {
-		struct bio *mbio;
-		int d = r10_bio->devs[i].devnum;
-		if (r10_bio->devs[i].bio) {
-			struct md_rdev *rdev = conf->mirrors[d].rdev;
-			mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
-			bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector,
-				 max_sectors);
-			r10_bio->devs[i].bio = mbio;
-
-			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr+
-					   choose_data_offset(r10_bio, rdev));
-			mbio->bi_bdev = rdev->bdev;
-			mbio->bi_end_io	= raid10_end_write_request;
-			bio_set_op_attrs(mbio, op, do_sync | do_fua);
-			if (test_bit(FailFast, &conf->mirrors[d].rdev->flags) &&
-			    enough(conf, d))
-				mbio->bi_opf |= MD_FAILFAST;
-			mbio->bi_private = r10_bio;
-
-			if (conf->mddev->gendisk)
-				trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
-						      mbio, disk_devt(conf->mddev->gendisk),
-						      r10_bio->sector);
-			/* flush_pending_writes() needs access to the rdev so...*/
-			mbio->bi_bdev = (void*)rdev;
-
-			atomic_inc(&r10_bio->remaining);
-
-			cb = blk_check_plugged(raid10_unplug, mddev,
-					       sizeof(*plug));
-			if (cb)
-				plug = container_of(cb, struct raid10_plug_cb,
-						    cb);
-			else
-				plug = NULL;
-			spin_lock_irqsave(&conf->device_lock, flags);
-			if (plug) {
-				bio_list_add(&plug->pending, mbio);
-				plug->pending_cnt++;
-			} else {
-				bio_list_add(&conf->pending_bio_list, mbio);
-				conf->pending_count++;
-			}
-			spin_unlock_irqrestore(&conf->device_lock, flags);
-			if (!plug)
-				md_wakeup_thread(mddev->thread);
-		}
-
-		if (r10_bio->devs[i].repl_bio) {
-			struct md_rdev *rdev = conf->mirrors[d].replacement;
-			if (rdev == NULL) {
-				/* Replacement just got moved to main 'rdev' */
-				smp_mb();
-				rdev = conf->mirrors[d].rdev;
-			}
-			mbio = bio_clone_fast(bio, GFP_NOIO, mddev->bio_set);
-			bio_trim(mbio, r10_bio->sector - bio->bi_iter.bi_sector,
-				 max_sectors);
-			r10_bio->devs[i].repl_bio = mbio;
-
-			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr +
-					   choose_data_offset(r10_bio, rdev));
-			mbio->bi_bdev = rdev->bdev;
-			mbio->bi_end_io	= raid10_end_write_request;
-			bio_set_op_attrs(mbio, op, do_sync | do_fua);
-			mbio->bi_private = r10_bio;
-
-			if (conf->mddev->gendisk)
-				trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
-						      mbio, disk_devt(conf->mddev->gendisk),
-						      r10_bio->sector);
-			/* flush_pending_writes() needs access to the rdev so...*/
-			mbio->bi_bdev = (void*)rdev;
-
-			atomic_inc(&r10_bio->remaining);
-
-			cb = blk_check_plugged(raid10_unplug, mddev,
-					       sizeof(*plug));
-			if (cb)
-				plug = container_of(cb, struct raid10_plug_cb,
-						    cb);
-			else
-				plug = NULL;
-			spin_lock_irqsave(&conf->device_lock, flags);
-			if (plug) {
-				bio_list_add(&plug->pending, mbio);
-				plug->pending_cnt++;
-			} else {
-				bio_list_add(&conf->pending_bio_list, mbio);
-				conf->pending_count++;
-			}
-			spin_unlock_irqrestore(&conf->device_lock, flags);
-			if (!plug)
-				md_wakeup_thread(mddev->thread);
-		}
+		if (r10_bio->devs[i].bio)
+			raid10_write_one_disk(mddev, r10_bio, bio, false,
+					      i, max_sectors);
+		if (r10_bio->devs[i].repl_bio)
+			raid10_write_one_disk(mddev, r10_bio, bio, true,
+					      i, max_sectors);
 	}
 
 	/* Don't remove the bias on 'remaining' (one_write_done) until
-- 
2.6.2


^ permalink raw reply related

* [PATCHv2 0/2] mdadm: setting device role of raid1 disk with failfast
From: Gioh Kim @ 2017-03-20  9:51 UTC (permalink / raw)
  To: jes.sorensen; +Cc: neilb, linux-raid, linux-kernel, Gioh Kim

Hi,

I've found a case that failfast option of mdadm set a disk faulty wrongly.
Following is my test case.

mdadm --create /dev/md100 -l 1 --failfast -e 1.2 -n 2 /dev/vdb /dev/vdc
mdadm /dev/md100 -a --failfast /dev/vdd

If I use failfast option, the vdd disk was faulty wrongly.
If not, it was spare.

This patch fixes a corner case for setting device role and
prints device role if it's faulty.
This patch is based on "mdadm - v4.0-8-g72b616a - 2017-03-07".

v2: fix a typo of v1

Gioh Kim (1):
  super1: ignore failfast flag for setting device role

Jack Wang (1):
  super1: check and output faulty dev role

 super1.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

-- 
2.5.0

^ permalink raw reply

* [PATCHv2 1/2] super1: ignore failfast flag for setting device role
From: Gioh Kim @ 2017-03-20  9:51 UTC (permalink / raw)
  To: jes.sorensen; +Cc: neilb, linux-raid, linux-kernel, Gioh Kim, Jack Wang
In-Reply-To: <1490003517-4216-1-git-send-email-gi-oh.kim@profitbricks.com>

There is corner case for setting device role,
if new device has failfast flag.
The failfast flag should be ignored.

Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 super1.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/super1.c b/super1.c
index 882cd61..f3520ac 100644
--- a/super1.c
+++ b/super1.c
@@ -1491,6 +1491,7 @@ static int add_to_super1(struct supertype *st, mdu_disk_info_t *dk,
 	struct devinfo *di, **dip;
 	bitmap_super_t *bms = (bitmap_super_t*)(((char*)sb) + MAX_SB_SIZE);
 	int rv, lockid;
+	int dk_state;
 
 	if (bms->version == BITMAP_MAJOR_CLUSTERED && dlm_funs_ready()) {
 		rv = cluster_get_dlmlock(&lockid);
@@ -1501,11 +1502,12 @@ static int add_to_super1(struct supertype *st, mdu_disk_info_t *dk,
 		}
 	}
 
-	if ((dk->state & 6) == 6) /* active, sync */
+	dk_state = dk->state & ~(1<<MD_DISK_FAILFAST);
+	if ((dk_state & 6) == 6) /* active, sync */
 		*rp = __cpu_to_le16(dk->raid_disk);
-	else if (dk->state & (1<<MD_DISK_JOURNAL))
+	else if (dk_state & (1<<MD_DISK_JOURNAL))
                 *rp = MD_DISK_ROLE_JOURNAL;
-	else if ((dk->state & ~2) == 0) /* active or idle -> spare */
+	else if ((dk_state & ~2) == 0) /* active or idle -> spare */
 		*rp = MD_DISK_ROLE_SPARE;
 	else
 		*rp = MD_DISK_ROLE_FAULTY;
-- 
2.5.0

^ permalink raw reply related

* [PATCHv2 2/2] super1: check and output faulty dev role
From: Gioh Kim @ 2017-03-20  9:51 UTC (permalink / raw)
  To: jes.sorensen; +Cc: neilb, linux-raid, linux-kernel, Jack Wang
In-Reply-To: <1490003517-4216-1-git-send-email-gi-oh.kim@profitbricks.com>

From: Jack Wang <jinpu.wang@profitbricks.com>

Output the real dev role in examine_super1, it will help to
find problem.

Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com>
---
 super1.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/super1.c b/super1.c
index f3520ac..c903371 100644
--- a/super1.c
+++ b/super1.c
@@ -501,8 +501,10 @@ static void examine_super1(struct supertype *st, char *homehost)
 #endif
 	printf("   Device Role : ");
 	role = role_from_sb(sb);
-	if (role >= MD_DISK_ROLE_FAULTY)
-		printf("spare\n");
+	if (role == MD_DISK_ROLE_SPARE)
+		printf("Spare\n");
+	else if (role == MD_DISK_ROLE_FAULTY)
+		printf("Faulty\n");
 	else if (role == MD_DISK_ROLE_JOURNAL)
 		printf("Journal\n");
 	else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_REPLACEMENT))
-- 
2.5.0


^ permalink raw reply related

* [PATCH] mdadm/bitmap:fixed typos in comments of bitmap.h
From: Zhilong Liu @ 2017-03-20 10:46 UTC (permalink / raw)
  To: jes.sorensen; +Cc: linux-raid, Zhilong Liu

bitmap.h: fixed trivial typos in comments

Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
 bitmap.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/bitmap.h b/bitmap.h
index b8fb071..7b1f80f 100644
--- a/bitmap.h
+++ b/bitmap.h
@@ -46,7 +46,7 @@
  *
  * The counter counts pending write requests, plus the on-disk bit.
  * When the counter is '1' and the resync bits are clear, the on-disk
- * bit can be cleared aswell, thus setting the counter to 0.
+ * bit can be cleared as well, thus setting the counter to 0.
  * When we set a bit, or in the counter (to start a write), if the fields is
  * 0, we first set the disk bit and set the counter to 1.
  *
@@ -185,7 +185,7 @@ struct bitmap_page {
 	 */
 	char *map;
 	/*
-	 * in emergencies (when map cannot be alloced), hijack the map
+	 * in emergencies (when map cannot be allocated), hijack the map
 	 * pointer and use it as two counters itself
 	 */
 	unsigned int hijacked;
-- 
2.6.6


^ permalink raw reply related

* proactive disk replacement
From: Jeff Allison @ 2017-03-20 12:47 UTC (permalink / raw)
  To: linux-raid

Hi all I’ve had a poke around but am yet to find something definitive.

I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks are getting a bit long in the tooth so before I get into problems I’ve bought 4 new disks to replace them.

I have a backup so if it all goes west I’m covered. So I’m looking for suggestions.

My current plan is just to replace the 2tb drives with the new 3tb drives and move on, I’d like to do it on line with out having to trash the array and start again, so does anyone have a game plan for doing that.

Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing something else 6tb raid 10 or something I’m open to suggestions.

Cheers Jeff

^ permalink raw reply

* Re: proactive disk replacement
From: Reindl Harald @ 2017-03-20 13:25 UTC (permalink / raw)
  To: Jeff Allison, linux-raid
In-Reply-To: <3FA2E00F-B107-4F3C-A9D3-A10CA5F81EC0@allygray.2y.net>



Am 20.03.2017 um 13:47 schrieb Jeff Allison:
> Hi all I’ve had a poke around but am yet to find something definitive.
>
> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks are getting a bit long in the tooth so before I get into problems I’ve bought 4 new disks to replace them.
>
> I have a backup so if it all goes west I’m covered. So I’m looking for suggestions.
>
> My current plan is just to replace the 2tb drives with the new 3tb drives and move on, I’d like to do it on line with out having to trash the array and start again, so does anyone have a game plan for doing that.
>
> Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing something else 6tb raid 10 or something I’m open to suggestions.

you just manually fail them and replace them the same way as if they 
would have died unexpected - done that multiple times

on machines without bayes i just poweroff, replace a disk and then clone 
the mbr and add the partitions also the same way as i do when one dies 
(partitions in case you didn't use the whole drives for the array)

http://bencane.com/2011/07/06/mdadm-manually-fail-a-drive/


^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-03-20 14:31 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Rajendra, Herbert Xu, Oded, Ondrej Mosnacek, Mike Snitzer,
	Linux kernel mailing list, Milan Broz, linux-raid, dm-devel,
	Mark Brown, Arnd Bergmann, linux-crypto, Shaohua Li,
	David S. Miller, Alasdair Kergon, Ofir
In-Reply-To: <CAOtvUMdrQA4okKhiCyVmosQ4Oc2PHO4k4wLrDz-_fCyWB1rWBw@mail.gmail.com>

[-- Attachment #1.1: Type: text/plain, Size: 2012 bytes --]

Hi,

On 8 March 2017 at 13:49, Binoy Jayan <binoy.jayan@linaro.org> wrote:
> Hi Gilad,
>
>> I gave it a spin on a x86_64 with 8 CPUs with AES-NI using cryptd and
>> on Arm  using CryptoCell hardware accelerator.
>>
>> There was no difference in performance between 512 and 4096 bytes
>> cluster size on the x86_64 (800 MB loop file system)
>>
>> There was an improvement in latency of 3.2% between 512 and 4096 bytes
>> cluster size on the Arm. I expect the performance benefits for this
>> test for Binoy's patch to be the same.
>>
>> In both cases the very naive test was a simple dd with block size of
>> 4096 bytes or the raw block device.
>>
>> I do not know what effect having a bigger cluster size would have on
>> have on other more complex file system operations.
>> Is there any specific benchmark worth testing with?

The multiple instances issue in /proc/crypto is fixed. It was because of
the IV code itself modifying the algorithm name inadvertently in the
global crypto algorithm lookup table when it was splitting up
"plain(cbc(aes))" into "plain" and "cbc(aes)" so as to invoke the child
algorithm.

I ran a few tests with dd, bonnie and FIO under Qemu - x86 using the
automated script [1] that I wrote to make the testing easy.
The tests were done on software implementations of the algorithms
as the real hardware was not available with me. According to the test,
I found that the sequential reads and writes have a good improvement
(5.7 %) in the data rate with the proposed solution while the random
reads shows a very little improvement. When tested with FIO, the
random writes also shows a small improvement (2.2%) but the random
reads show a little deterioration in performance (4 %).

When tested in arm hardware, only the sequential writes with bonnie
shows improvement (5.6%). All other tests shows degraded performance
in the absence of crypto hardware.

[1] https://github.com/binoyjayan/utilities/blob/master/utils/dmtest
Dependencies: dd [Full version], bonnie, fio

Thanks,
Binoy

[-- Attachment #1.2: Type: text/html, Size: 2545 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-03-20 14:38 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Milan Broz, Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto,
	Mark Brown, Arnd Bergmann, Linux kernel mailing list,
	Alasdair Kergon, Mike Snitzer, dm-devel, Shaohua Li, linux-raid,
	Rajendra, Ondrej Mosnacek
In-Reply-To: <CAOtvUMdrQA4okKhiCyVmosQ4Oc2PHO4k4wLrDz-_fCyWB1rWBw@mail.gmail.com>

On 6 March 2017 at 20:08, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>
> I gave it a spin on a x86_64 with 8 CPUs with AES-NI using cryptd and
> on Arm  using CryptoCell hardware accelerator.
>
> There was no difference in performance between 512 and 4096 bytes
> cluster size on the x86_64 (800 MB loop file system)
>
> There was an improvement in latency of 3.2% between 512 and 4096 bytes
> cluster size on the Arm. I expect the performance benefits for this
> test for Binoy's patch to be the same.
>
> In both cases the very naive test was a simple dd with block size of
> 4096 bytes or the raw block device.
>
> I do not know what effect having a bigger cluster size would have on
> have on other more complex file system operations.
> Is there any specific benchmark worth testing with?

The multiple instances issue in /proc/crypto is fixed. It was because of
the IV code itself modifying the algorithm name inadvertently in the
global crypto algorithm lookup table when it was splitting up
"plain(cbc(aes))" into "plain" and "cbc(aes)" so as to invoke the child
algorithm.

I ran a few tests with dd, bonnie and FIO under Qemu - x86 using the
automated script [1] that I wrote to make the testing easy.
The tests were done on software implementations of the algorithms
as the real hardware was not available with me. According to the test,
I found that the sequential reads and writes have a good improvement
(5.7 %) in the data rate with the proposed solution while the random
reads shows a very little improvement. When tested with FIO, the
random writes also shows a small improvement (2.2%) but the random
reads show a little deterioration in performance (4 %).

When tested in arm hardware, only the sequential writes with bonnie
shows improvement (5.6%). All other tests shows degraded performance
in the absence of crypto hardware.

[1] https://github.com/binoyjayan/utilities/blob/master/utils/dmtest
Dependencies: dd [Full version], bonnie, fio

Thanks,
Binoy

^ permalink raw reply

* Re: proactive disk replacement
From: Adam Goryachev @ 2017-03-20 14:59 UTC (permalink / raw)
  To: Jeff Allison, linux-raid
In-Reply-To: <3FA2E00F-B107-4F3C-A9D3-A10CA5F81EC0@allygray.2y.net>



On 20/3/17 23:47, Jeff Allison wrote:
> Hi all I’ve had a poke around but am yet to find something definitive.
>
> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks are getting a bit long in the tooth so before I get into problems I’ve bought 4 new disks to replace them.
>
> I have a backup so if it all goes west I’m covered. So I’m looking for suggestions.
>
> My current plan is just to replace the 2tb drives with the new 3tb drives and move on, I’d like to do it on line with out having to trash the array and start again, so does anyone have a game plan for doing that.
Yes, do not fail a disk and then replace it, use the newer replace 
method (it keeps redundancy in the array).
Even better would be to add a disk, and convert to RAID6, then add a 
second disk (using replace), and so on, then remove the last disk, grow 
the array to fill the 3TB, and then reduce the number of disks in the raid.
This way, you end up with RAID6...
> Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing something else 6tb raid 10 or something I’m open to suggestions.
I'd feel safer with RAID6, but it depends on your requirements. RAID10 
is also a nice option, but, it depends...

Regards,
Adam


^ permalink raw reply

* Re: proactive disk replacement
From: Reindl Harald @ 2017-03-20 15:04 UTC (permalink / raw)
  To: Adam Goryachev, Jeff Allison, linux-raid
In-Reply-To: <11c21a22-4bbf-7b16-5e64-8932be768c68@websitemanagers.com.au>



Am 20.03.2017 um 15:59 schrieb Adam Goryachev:
> On 20/3/17 23:47, Jeff Allison wrote:
>> Hi all I’ve had a poke around but am yet to find something definitive.
>>
>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>> disks are getting a bit long in the tooth so before I get into
>> problems I’ve bought 4 new disks to replace them.
>>
>> I have a backup so if it all goes west I’m covered. So I’m looking for
>> suggestions.
>>
>> My current plan is just to replace the 2tb drives with the new 3tb
>> drives and move on, I’d like to do it on line with out having to trash
>> the array and start again, so does anyone have a game plan for doing
>> that.
> Yes, do not fail a disk and then replace it, use the newer replace
> method (it keeps redundancy in the array)

how should it keep redundancy when you have to remove a disk anyways 
except you have enough slots to at least temporary add a additional one?

^ permalink raw reply

* Re: proactive disk replacement
From: Adam Goryachev @ 2017-03-20 15:23 UTC (permalink / raw)
  To: Reindl Harald, Jeff Allison, linux-raid
In-Reply-To: <c20d279a-62b1-2c3d-cdaf-39a838034afb@thelounge.net>



On 21/3/17 02:04, Reindl Harald wrote:
>
>
> Am 20.03.2017 um 15:59 schrieb Adam Goryachev:
>> On 20/3/17 23:47, Jeff Allison wrote:
>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>
>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>>> disks are getting a bit long in the tooth so before I get into
>>> problems I’ve bought 4 new disks to replace them.
>>>
>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>> suggestions.
>>>
>>> My current plan is just to replace the 2tb drives with the new 3tb
>>> drives and move on, I’d like to do it on line with out having to trash
>>> the array and start again, so does anyone have a game plan for doing
>>> that.
>> Yes, do not fail a disk and then replace it, use the newer replace
>> method (it keeps redundancy in the array)
>
> how should it keep redundancy when you have to remove a disk anyways 
> except you have enough slots to at least temporary add a additional one?
Yes, assuming you can (at least temporarily) add an additional disk, 
then you will not lose redundancy by using the replace instead of 
fail/add method.

Regards,
Adam

^ permalink raw reply

* Re: stripe_cache_size, some info
From: Wols Lists @ 2017-03-20 15:59 UTC (permalink / raw)
  To: Gandalf Corvotempesta, linux-raid
In-Reply-To: <CAJH6TXhL4=MLanyvQh8UMkBmfkfjPd-frsCOC3PqLnDXCpMetw@mail.gmail.com>

On 19/03/17 18:35, Gandalf Corvotempesta wrote:
> As I would like to replace most of our HW raid controller with mdadm,
> any suggestion on how to improve RAID-6 speed ?

Burst speed, or sustained speed? Big difference ...
> 
> Modern CPU aren't an issue, I don't think that double-parity
> calculation could create any bottleneck on a modern CPU.

Using a journal on an SSD will offload stuff and give you a decent burst
speed, I suspect. You'll need to get benchmarks, but that should mean
you don't notice a slow background write speed.

> The real advantages of a raid controller are mostly 2:
> 
> 1) the writeback cache (1GB or 2GB)
> 2) the ability to automatically replace a disk by hotswapping it.
> 
> Any solution to this ? For the "2", i've tried by configuring the
> POLICY in mdadm.conf but new disk is never reconized and I always have
> to manually add the new disk to the array.

I think you have to manually add the disk to the array (group) as a
spare first.

And I would avoid that entirely if I can - put the new disk in, do a
--replace, and then remove the old one. Doing a hotswap like that will
increase the stress on the array, and increased stress means another
disk is more likely to fail.

Cheers,
Wol

^ permalink raw reply

* Re: stripe_cache_size, some info
From: Gandalf Corvotempesta @ 2017-03-20 16:13 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid
In-Reply-To: <58CFFC47.3000306@youngman.org.uk>

2017-03-20 16:59 GMT+01:00 Wols Lists <antlists@youngman.org.uk>:
> Burst speed, or sustained speed? Big difference ...

Both :)

> And I would avoid that entirely if I can - put the new disk in, do a
> --replace, and then remove the old one. Doing a hotswap like that will
> increase the stress on the array, and increased stress means another
> disk is more likely to fail.

On newer server, i'll tend to avoid using all slots because of this.
With at least 1 slot available, cool things could be done, like
replace disks without compromize redundancy and so on.

But on older server, i don't have enough slot available and the only
way to replace a disk is.... directly replace a disk :)

^ permalink raw reply

* Re: proactive disk replacement
From: Wols Lists @ 2017-03-20 16:19 UTC (permalink / raw)
  To: Adam Goryachev, Reindl Harald, Jeff Allison, linux-raid
In-Reply-To: <3df5e6da-6085-58fb-2811-cb4be843e676@websitemanagers.com.au>

On 20/03/17 15:23, Adam Goryachev wrote:
> 
> 
> On 21/3/17 02:04, Reindl Harald wrote:
>>
>>
>> Am 20.03.2017 um 15:59 schrieb Adam Goryachev:
>>> On 20/3/17 23:47, Jeff Allison wrote:
>>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>>
>>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>>>> disks are getting a bit long in the tooth so before I get into
>>>> problems I’ve bought 4 new disks to replace them.
>>>>
>>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>>> suggestions.
>>>>
>>>> My current plan is just to replace the 2tb drives with the new 3tb
>>>> drives and move on, I’d like to do it on line with out having to trash
>>>> the array and start again, so does anyone have a game plan for doing
>>>> that.
>>> Yes, do not fail a disk and then replace it, use the newer replace
>>> method (it keeps redundancy in the array)
>>
>> how should it keep redundancy when you have to remove a disk anyways
>> except you have enough slots to at least temporary add a additional one?
> Yes, assuming you can (at least temporarily) add an additional disk,
> then you will not lose redundancy by using the replace instead of
> fail/add method.
> 
Take a look at the raid wiki. Especially this page ...

https://raid.wiki.kernel.org/index.php/Replacing_a_failed_drive

Okay, it's my work (unless people have come in since and edited it) but
I make a point of asking "the people who should know" to check my work
if I'm at all unsure. So this will have been looked over for mistakes by
various people on the list who either write the code or provide advice
and support.

And yes, as you can see from that page, I'd say add a new disk then
--replace it into the array. And upgrading the array to raid6 is a good
idea. But Adam's way I think you need two extra temporary drive slots.
What I think you can do is - the new drives you need to make the
underlying partition the full 3TB. You can then replace all four drives.
So long as 2*3TB >= 3*2TB (don't laugh - it might not be!!!) you should
be able to reduce the number of drives to three then add the fourth back
to give raid6.

The other thing is, if you've got the space for Adam's method, you could
always temporarily create a 4TB drive by combining 2*2TB in a raid0 -
probably best striped rather than linear.

Cheers,
Wol


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox