Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Shrinking an array
From: Wakko Warner @ 2017-04-11  0:30 UTC (permalink / raw)
  To: linux-raid

I have a question about shrinking an array.  My current array is 4x 2tb
disks in raid6 (md0).  The array was created on the 2nd partition of each
disk and spans most of the disk.  I would like to replace the 2tb disks with
750gb disks.  md0 is a luks container with lvm underneath.  I have less than
1tb actually in use.  What would the recommended procedure be for shrinking
this?  I've watched this list, but I don't think I've come across anyone
actually wanting to do this before.
I'm thinking of these steps already:
1) Shrink PV.
2) Shrink luks.  I'm aware that there is not size metadata, but the dm
mapping would need to be shrunk.
3) Shrink md0.  I did this once when I changed a 6 drive raid6 into a 5
drive raid6.  Would I use --array-size= or --size= ?  I understand the
difference is the size of md0 vs the individual members.

So for number 4, if md0 is now small enough, will it accept a member that is
smaller?  If so, I should beable to add the member to the array and issue
--replace.

Thanks.

-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.

^ permalink raw reply

* Re: Shrinking an array
From: Adam Goryachev @ 2017-04-11  1:05 UTC (permalink / raw)
  To: Wakko Warner, linux-raid
In-Reply-To: <20170411003008.GA18538@animx.eu.org>

On 11/04/17 10:30, Wakko Warner wrote:
> I have a question about shrinking an array.  My current array is 4x 2tb
> disks in raid6 (md0).  The array was created on the 2nd partition of each
> disk and spans most of the disk.  I would like to replace the 2tb disks with
> 750gb disks.  md0 is a luks container with lvm underneath.  I have less than
> 1tb actually in use.  What would the recommended procedure be for shrinking
> this?  I've watched this list, but I don't think I've come across anyone
> actually wanting to do this before.
> I'm thinking of these steps already:
> 1) Shrink PV.
> 2) Shrink luks.  I'm aware that there is not size metadata, but the dm
> mapping would need to be shrunk.
> 3) Shrink md0.  I did this once when I changed a 6 drive raid6 into a 5
> drive raid6.  Would I use --array-size= or --size= ?  I understand the
> difference is the size of md0 vs the individual members.
>
> So for number 4, if md0 is now small enough, will it accept a member that is
> smaller?  If so, I should beable to add the member to the array and issue
> --replace.
>
> Thanks.
>
I think the order is wrong.... or I mis-understood the layering. You 
need to shrink the highest layer first and work down the stack, for LVM 
on luks on RAID it would be something like this:
1) Reduce the filesystem size
2) Reduce the LV size
3) Reduce the PV size
4) Reduce the luks size
5) Reduce the RAID (mdadm) size
6) Replace the physical devices with smaller ones (or reduce the 
partition size/etc as needed)

I generally reduce my each one with a decent margin, and then when I'm 
finished, I increase each one to fill the available space (after the 
physical device size is changed). This avoids issues with accidentally 
trimming too much and then losing data/corrupting data. You should also 
verify that all your data is accessible after each step, most steps are 
reversible if you identify the issue quickly enough (at least with 
simple stacks when changing partition size, LVM and/or luks might 
complicate that).

Regards,
Adam



-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* Re: Shrinking an array
From: Wakko Warner @ 2017-04-11  2:00 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid
In-Reply-To: <1f0601cb-f91d-0a1e-fc1d-5dd1c126a467@websitemanagers.com.au>

Adam Goryachev wrote:
> On 11/04/17 10:30, Wakko Warner wrote:
> >I have a question about shrinking an array.  My current array is 4x 2tb
> >disks in raid6 (md0).  The array was created on the 2nd partition of each
> >disk and spans most of the disk.  I would like to replace the 2tb disks with
> >750gb disks.  md0 is a luks container with lvm underneath.  I have less than
> >1tb actually in use.  What would the recommended procedure be for shrinking
> >this?  I've watched this list, but I don't think I've come across anyone
> >actually wanting to do this before.
> >I'm thinking of these steps already:
> >1) Shrink PV.
> >2) Shrink luks.  I'm aware that there is not size metadata, but the dm
> >mapping would need to be shrunk.
> >3) Shrink md0.  I did this once when I changed a 6 drive raid6 into a 5
> >drive raid6.  Would I use --array-size= or --size= ?  I understand the
> >difference is the size of md0 vs the individual members.
> >
> >So for number 4, if md0 is now small enough, will it accept a member that is
> >smaller?  If so, I should beable to add the member to the array and issue
> >--replace.
> >
> >Thanks.
> >
> I think the order is wrong.... or I mis-understood the layering. You
> need to shrink the highest layer first and work down the stack, for
> LVM on luks on RAID it would be something like this:
> 1) Reduce the filesystem size
> 2) Reduce the LV size
> 3) Reduce the PV size
> 4) Reduce the luks size
> 5) Reduce the RAID (mdadm) size
> 6) Replace the physical devices with smaller ones (or reduce the
> partition size/etc as needed)

Thanks, however, 1 and 2 doesn't need to happen.  I deleted several LVs.  My
highest PE in use is 771583 (each extent is 1M) and the last lv could be
moved somewhere else which would leave my highest PE 87982. 

> I generally reduce my each one with a decent margin, and then when
> I'm finished, I increase each one to fill the available space (after
> the physical device size is changed). This avoids issues with
> accidentally trimming too much and then losing data/corrupting data.
> You should also verify that all your data is accessible after each
> step, most steps are reversible if you identify the issue quickly
> enough (at least with simple stacks when changing partition size,
> LVM and/or luks might complicate that).

I'm not that concerned about the LV that is at the highest PE range.  If I
loose it, I loose it.  it's the reduce the raid size I was wondering about. 
Can you give me an example of the command to run for it?

I would be interested in doing this while the system is running (it only has
1180 days uptime =)  However, I don't have the replacement disks yet.  I
wanted ideas before I actually do it.

-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.

^ permalink raw reply

* Re: Shrinking an array
From: NeilBrown @ 2017-04-11  4:33 UTC (permalink / raw)
  To: Wakko Warner, linux-raid
In-Reply-To: <20170411003008.GA18538@animx.eu.org>

[-- Attachment #1: Type: text/plain, Size: 2102 bytes --]

On Mon, Apr 10 2017, Wakko Warner wrote:

> I have a question about shrinking an array.  My current array is 4x 2tb
> disks in raid6 (md0).  The array was created on the 2nd partition of each
> disk and spans most of the disk.  I would like to replace the 2tb disks with
> 750gb disks.  md0 is a luks container with lvm underneath.  I have less than
> 1tb actually in use.  What would the recommended procedure be for shrinking
> this?  I've watched this list, but I don't think I've come across anyone
> actually wanting to do this before.
> I'm thinking of these steps already:
> 1) Shrink PV.
> 2) Shrink luks.  I'm aware that there is not size metadata, but the dm
> mapping would need to be shrunk.
> 3) Shrink md0.  I did this once when I changed a 6 drive raid6 into a 5
> drive raid6.  Would I use --array-size= or --size= ?  I understand the
> difference is the size of md0 vs the individual members.

You don't need --array-size, as reducing --size is non-destructive.
Reducing the number of devices *is* destructive, so if you were to do
that, you would need to adjust the --array-size first.

So when you have prepared the contents of the array, use
  mdadm /dev/mdXXX --grow --size=750G

or whatever size you think is appropriate.

Then check that all your data is still accessible.  e.g. fsck your
filesystem.
If you are confident that the data is still accessible, you can continue
to --replace devices.
If not, you can
  mdadm /dev/mdXXX --grow --size=max

to restore the previous state, and then try to figure out what went
wrong.

NeilBrown


>
> So for number 4, if md0 is now small enough, will it accept a member that is
> smaller?  If so, I should beable to add the member to the array and issue
> --replace.
>
> Thanks.
>
> -- 
>  Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
>  million bugs.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: Shrinking an array
From: Roman Mamedov @ 2017-04-11  6:58 UTC (permalink / raw)
  To: Wakko Warner; +Cc: linux-raid
In-Reply-To: <20170411003008.GA18538@animx.eu.org>

On Mon, 10 Apr 2017 20:30:08 -0400
Wakko Warner <wakko@animx.eu.org> wrote:

> I have less than 1tb actually in use.

I'd say just create a new array from scratch on new drives and copy data over.
Going to great lengths with all the layering to ensure proper and safe
shrinking on each is just wasted effort to babysit what is basically an empty
array.

-- 
With respect,
Roman

^ permalink raw reply

* Re: [PATCH 1/4] raid5-ppl: use a single mempool for ppl_io_unit and header_page
From: Artur Paszkiewicz @ 2017-04-11  8:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, songliubraving
In-Reply-To: <20170410190948.v33qgv63r423zfbb@kernel.org>

On 04/10/2017 09:09 PM, Shaohua Li wrote:
>> +static void *ppl_io_pool_alloc(gfp_t gfp_mask, void *pool_data)
>> +{
>> +	struct kmem_cache *kc = pool_data;
>> +	struct ppl_io_unit *io;
>> +
>> +	io = kmem_cache_alloc(kc, gfp_mask);
>> +	if (!io)
>> +		return NULL;
>> +
>> +	io->header_page = alloc_page(gfp_mask);
>> +	if (!io->header_page) {
>> +		kmem_cache_free(kc, io);
>> +		return NULL;
>> +	}
>> +
>> +	return io;
> 
> Maybe directly use GFP_NOWAIT here, we don't use other gfp

Do you mean ignore the gfp_mask parameter? I don't think we should do
that. It is provided by the mempool implementation. We only use
GFP_NOWAIT for explicit allocations, but GFP_KERNEL is used for
allocating initial items in mempool_create().

>> +}
>> +
>> +static void ppl_io_pool_free(void *element, void *pool_data)
>> +{
>> +	struct kmem_cache *kc = pool_data;
>> +	struct ppl_io_unit *io = element;
>> +
>> +	__free_page(io->header_page);
>> +	kmem_cache_free(kc, io);
>> +}
>> +
>>  static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>>  					  struct stripe_head *sh)
>>  {
>> @@ -202,18 +228,19 @@ static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>>  	struct ppl_io_unit *io;
>>  	struct ppl_header *pplhdr;
>>  
>> -	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
>> +	io = mempool_alloc(ppl_conf->io_pool, GFP_NOWAIT);
>>  	if (!io)
>>  		return NULL;
>>  
>> -	memset(io, 0, sizeof(*io));
>>  	io->log = log;
>> +	io->entries_count = 0;
>> +	io->pp_size = 0;
>> +	io->submitted = false;
> 
> I'd suggest moving the memset to ppl_io_pool_alloc. Don't think we need to
> optimize to avoid setting header_page. And doing memset is less error prone,
> for example adding new fields.

OK, I'll change this and resend.

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH 1/4] raid5-ppl: use a single mempool for ppl_io_unit and header_page
From: Artur Paszkiewicz @ 2017-04-11  9:20 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, songliubraving
In-Reply-To: <87fd9082-46bb-0836-7882-f0f572904549@intel.com>

On 04/11/2017 10:28 AM, Artur Paszkiewicz wrote:
> On 04/10/2017 09:09 PM, Shaohua Li wrote:
>>> +static void *ppl_io_pool_alloc(gfp_t gfp_mask, void *pool_data)
>>> +{
>>> +	struct kmem_cache *kc = pool_data;
>>> +	struct ppl_io_unit *io;
>>> +
>>> +	io = kmem_cache_alloc(kc, gfp_mask);
>>> +	if (!io)
>>> +		return NULL;
>>> +
>>> +	io->header_page = alloc_page(gfp_mask);
>>> +	if (!io->header_page) {
>>> +		kmem_cache_free(kc, io);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	return io;
>>
>> Maybe directly use GFP_NOWAIT here, we don't use other gfp
> 
> Do you mean ignore the gfp_mask parameter? I don't think we should do
> that. It is provided by the mempool implementation. We only use
> GFP_NOWAIT for explicit allocations, but GFP_KERNEL is used for
> allocating initial items in mempool_create().
> 
>>> +}
>>> +
>>> +static void ppl_io_pool_free(void *element, void *pool_data)
>>> +{
>>> +	struct kmem_cache *kc = pool_data;
>>> +	struct ppl_io_unit *io = element;
>>> +
>>> +	__free_page(io->header_page);
>>> +	kmem_cache_free(kc, io);
>>> +}
>>> +
>>>  static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>>>  					  struct stripe_head *sh)
>>>  {
>>> @@ -202,18 +228,19 @@ static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>>>  	struct ppl_io_unit *io;
>>>  	struct ppl_header *pplhdr;
>>>  
>>> -	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
>>> +	io = mempool_alloc(ppl_conf->io_pool, GFP_NOWAIT);
>>>  	if (!io)
>>>  		return NULL;
>>>  
>>> -	memset(io, 0, sizeof(*io));
>>>  	io->log = log;
>>> +	io->entries_count = 0;
>>> +	io->pp_size = 0;
>>> +	io->submitted = false;
>>
>> I'd suggest moving the memset to ppl_io_pool_alloc. Don't think we need to
>> optimize to avoid setting header_page. And doing memset is less error prone,
>> for example adding new fields.
> 
> OK, I'll change this and resend.

Well, on second thought, actually I can't move the memset to
ppl_io_pool_alloc. Mempool can use the preallocated elements, which are
then returned back to the pool after calling mempool_free() and can be
reused later. So a ppl_io_unit must be initialized after retrieving it
from mempool_alloc(). I can leave the memset in ppl_new_iounit() but
io->header_page pointer will have to be temporarily copied before
zeroing the iounit and then set back. Is this ok?

Thanks,
Artur

^ permalink raw reply

* Re: 4.10 + 765d704db: no improvemtn in write rates with md/raid5 group_thread_cnt > 0
From: Nix @ 2017-04-11 10:01 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Shaohua Li
In-Reply-To: <20170410201057.qxypmzaw4gtmkwvd@kernel.org>

On 10 Apr 2017, Shaohua Li stated:

> On Wed, Apr 05, 2017 at 03:13:48PM +0100, Nix wrote:
>> So you'd expect write rates on a RAID-5 array to be higher than write rates on a
>> single spinning-rust disk, right? Because, even with Shaohua's commit
>> 765d704db1f583630d52 applied atop 4.10, I see little sign of it. Does this
>> commit depend upon something else to stop death by seeking with
>> group_thread_cnt > 0? It didn't look like it to me...
>> 
>> The results Shaohua showed in the original commit were very impressive, but for
>> the life of me I can't figure out how to get anything like them.
>
> That only works well with large iodepth. For single write, we are still far
> from the BW in theory. I actually wrote in the commit log:
>
> "We are pretty close to the maximum bandwidth in the large iodepth
> iodepth case. The performance gap of small iodepth sequential write
> between software raid and theory value is still very big though, because
> we don't have an efficient pipeline."

Ah right, I missed the significance of that. So this helps only if you
have multiple simultaneous multithreaded/async I/Os to the same file at
the same time? Damn, no help in any of my common use cases yet :( :( :(
except maybe massively-parallel compiles, but they are never write-bound
except when linking, and *that* is serial.

I guess I have to wait and hope for a better pipeline :)

^ permalink raw reply

* [PATCH] md: return 0 instead of error in rdev_attr_show()
From: Michael Wang @ 2017-04-11 10:14 UTC (permalink / raw)
  To: linux-raid, linux-kernel@vger.kernel.org; +Cc: Shaohua Li, NeilBrown


sysfs_kf_read() expect the show() callback return the dumped
length, while rdev_attr_show() can return the error which lead
into overflow:

 BUG: unable to handle kernel paging request at ffff88040b084000
 IP: [<ffffffff812fe8f4>] __memmove+0x24/0x1a0
 PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 800000040b084161
 Oops: 0003 [#1] SMP 
 [snip]
 Call Trace:
  [<ffffffff81208c30>] ? sysfs_kf_read+0x80/0xb0
  [<ffffffff81207e0b>] kernfs_fop_read+0xab/0x160
  [<ffffffff81190be8>] __vfs_read+0x28/0xd0
  [<ffffffff811911f6>] vfs_read+0x86/0x130
  [<ffffffff81191fa6>] SyS_read+0x46/0xa0
  [<ffffffff8158bc97>] entry_SYSCALL_64_fastpath+0x12/0x6a

Simply return 0 in case of error solved the problem.

Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
---
 drivers/md/md.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1db88d7..d46d714 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
 	struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
 
-	if (!entry->show)
-		return -EIO;
-	if (!rdev->mddev)
-		return -EBUSY;
+	if (!entry->show || !rdev->mddev)
+		return 0;
+
 	return entry->show(rdev, page);
 }
 
-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH] md: return 0 instead of error in rdev_attr_show()
From: Michael Wang @ 2017-04-11 10:32 UTC (permalink / raw)
  To: linux-raid, linux-kernel@vger.kernel.org; +Cc: Shaohua Li, NeilBrown
In-Reply-To: <769019bb-84a1-044f-bac1-d44ff1ee9177@profitbricks.com>

We found the upstream fix, sorry for the noise...

Regards,
Michael Wang

On 04/11/2017 12:14 PM, Michael Wang wrote:
> 
> sysfs_kf_read() expect the show() callback return the dumped
> length, while rdev_attr_show() can return the error which lead
> into overflow:
> 
>  BUG: unable to handle kernel paging request at ffff88040b084000
>  IP: [<ffffffff812fe8f4>] __memmove+0x24/0x1a0
>  PGD 1edb067 PUD 1ede067 PMD 406b9a063 PTE 800000040b084161
>  Oops: 0003 [#1] SMP 
>  [snip]
>  Call Trace:
>   [<ffffffff81208c30>] ? sysfs_kf_read+0x80/0xb0
>   [<ffffffff81207e0b>] kernfs_fop_read+0xab/0x160
>   [<ffffffff81190be8>] __vfs_read+0x28/0xd0
>   [<ffffffff811911f6>] vfs_read+0x86/0x130
>   [<ffffffff81191fa6>] SyS_read+0x46/0xa0
>   [<ffffffff8158bc97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> 
> Simply return 0 in case of error solved the problem.
> 
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
>  drivers/md/md.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1db88d7..d46d714 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -3271,10 +3271,9 @@ rdev_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
>  	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
>  	struct md_rdev *rdev = container_of(kobj, struct md_rdev, kobj);
>  
> -	if (!entry->show)
> -		return -EIO;
> -	if (!rdev->mddev)
> -		return -EBUSY;
> +	if (!entry->show || !rdev->mddev)
> +		return 0;
> +
>  	return entry->show(rdev, page);
>  }
>  
> 

^ permalink raw reply

* Re: Can we deprecate ioctl(RAID_VERSION)?
From: Jes Sorensen @ 2017-04-11 14:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Hannes Reinecke, kernel-team
In-Reply-To: <878tn9yymi.fsf@notabene.neil.brown.name>

On 04/09/2017 07:01 PM, NeilBrown wrote:
> On Fri, Apr 07 2017, jes.sorensen@gmail.com wrote:
>
>>
>> Next question since I am wearing my 'what is this old stuff doing' hat.
>> mdassemble? Does anything still use this? The reason is a lot of the
>> newer features are explicitly included, and switching to sysfs is
>> effectively going to kill it, unless it gets a major upgrade.
>>
>
> I was never a big fan, of mdassemble, but it is smaller than mdadm and
> some people apparently have (or had) space-constrained boot
> environments.
> Maybe post to the linux-raid list with a subject "mdassemble is going
> way, do you care?". ??

Makes sense - looking at the code it wasn't clear if it is even usable 
in modern times since it ignores sysfs access.

I'll post another message and get out the big hammer!

Cheers,
Jes

^ permalink raw reply

* mdassemble will disappear - last chance to stop it!
From: Jes Sorensen @ 2017-04-11 14:48 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Hi,

 From conversation in other another thread, it seems obvious that 
mdassemble is a thing of the past that nobody really seems to care about.

To get rid of the cruft and having to maintain code that has little 
chance of working with modern kernels, I am going to start removing all 
trace of it from the mdadm codebase.

If you have any compelling reason for wanting to keep mdassemble around, 
now would be a good time to state it.

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH 2/2] mdadm/manpage:clustered array doesn't support --array-size yet
From: Jes Sorensen @ 2017-04-11 14:49 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: linux-raid
In-Reply-To: <1491811478-9234-2-git-send-email-zlliu@suse.com>

On 04/10/2017 04:04 AM, Zhilong Liu wrote:
> update man page for --array-size:
> clustered array isn't allowed to change array_sector by now.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>  mdadm.8.in | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/mdadm.8.in b/mdadm.8.in
> index f006cf5..3555516 100644
> --- a/mdadm.8.in
> +++ b/mdadm.8.in
> @@ -541,6 +541,10 @@ A value of
>  restores the apparent size of the array to be whatever the real
>  amount of available space is.
>
> +The
> +.B clustered
> +array isn't supported this parameter yet.

Should this say "Clustered arrays do not support this parameter yet." ?

Thanks,
Jes



^ permalink raw reply

* Re: [PATCH 1/2] mdadm/manpage:update description for readonly in manpage
From: Jes Sorensen @ 2017-04-11 14:51 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: linux-raid
In-Reply-To: <1491811478-9234-1-git-send-email-zlliu@suse.com>

On 04/10/2017 04:04 AM, Zhilong Liu wrote:
> update readonly/readwrite in man page:
> Currently both the readwrite and readonly are worked well,
> thus updates description for them.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>  mdadm.8.in | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

A couple of comments below.

>  .TP
>  .B \-\-readonly
> -start the array readonly \(em not supported yet.
> +start the array with readonly status.

I believe it would be more correct to say "start the array in readonly 
mode" here.

>  .SH MANAGE MODE
>  .HP 12
> @@ -2438,12 +2439,11 @@ This will fully activate a partially assembled md array.
>
>  .TP
>  .B \-\-readonly
> -This will mark an active array as read-only, providing that it is
> -not currently being used.
> +This will set an active array as read-only status.

I think this chance is not good, and should be omitted.

>  .TP
>  .B \-\-readwrite
> -This will change a
> +This will change an
>  .B readonly
>  array back to being read/write.

This change is grammatically incorrect I believe and the original text 
is correct.

Thanks,
Jes


^ permalink raw reply

* Re: [PATCH] mdadm.c:fix compile warning "mdfd is uninitialized"
From: Jes Sorensen @ 2017-04-11 15:28 UTC (permalink / raw)
  To: Zhilong Liu; +Cc: linux-raid
In-Reply-To: <1491799792-6974-1-git-send-email-zlliu@suse.com>

On 04/10/2017 12:49 AM, Zhilong Liu wrote:
> Initialized the mdfd as -1 to prevent compile error
> of some compilers.
> For example, gcc version 4.8.5(SUSE Linux).
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
> ---
>  mdadm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Applied!

Thanks,
Jes



^ permalink raw reply

* Re: [PATCH 1/4] raid5-ppl: use a single mempool for ppl_io_unit and header_page
From: Shaohua Li @ 2017-04-11 15:41 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, songliubraving
In-Reply-To: <029f973c-77be-fca9-952b-2d3921b5f6cd@intel.com>

On Tue, Apr 11, 2017 at 11:20:13AM +0200, Artur Paszkiewicz wrote:
> On 04/11/2017 10:28 AM, Artur Paszkiewicz wrote:
> > On 04/10/2017 09:09 PM, Shaohua Li wrote:
> >>> +static void *ppl_io_pool_alloc(gfp_t gfp_mask, void *pool_data)
> >>> +{
> >>> +	struct kmem_cache *kc = pool_data;
> >>> +	struct ppl_io_unit *io;
> >>> +
> >>> +	io = kmem_cache_alloc(kc, gfp_mask);
> >>> +	if (!io)
> >>> +		return NULL;
> >>> +
> >>> +	io->header_page = alloc_page(gfp_mask);
> >>> +	if (!io->header_page) {
> >>> +		kmem_cache_free(kc, io);
> >>> +		return NULL;
> >>> +	}
> >>> +
> >>> +	return io;
> >>
> >> Maybe directly use GFP_NOWAIT here, we don't use other gfp
> > 
> > Do you mean ignore the gfp_mask parameter? I don't think we should do
> > that. It is provided by the mempool implementation. We only use
> > GFP_NOWAIT for explicit allocations, but GFP_KERNEL is used for
> > allocating initial items in mempool_create().
> > 
> >>> +}
> >>> +
> >>> +static void ppl_io_pool_free(void *element, void *pool_data)
> >>> +{
> >>> +	struct kmem_cache *kc = pool_data;
> >>> +	struct ppl_io_unit *io = element;
> >>> +
> >>> +	__free_page(io->header_page);
> >>> +	kmem_cache_free(kc, io);
> >>> +}
> >>> +
> >>>  static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> >>>  					  struct stripe_head *sh)
> >>>  {
> >>> @@ -202,18 +228,19 @@ static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> >>>  	struct ppl_io_unit *io;
> >>>  	struct ppl_header *pplhdr;
> >>>  
> >>> -	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> >>> +	io = mempool_alloc(ppl_conf->io_pool, GFP_NOWAIT);
> >>>  	if (!io)
> >>>  		return NULL;
> >>>  
> >>> -	memset(io, 0, sizeof(*io));
> >>>  	io->log = log;
> >>> +	io->entries_count = 0;
> >>> +	io->pp_size = 0;
> >>> +	io->submitted = false;
> >>
> >> I'd suggest moving the memset to ppl_io_pool_alloc. Don't think we need to
> >> optimize to avoid setting header_page. And doing memset is less error prone,
> >> for example adding new fields.
> > 
> > OK, I'll change this and resend.
> 
> Well, on second thought, actually I can't move the memset to
> ppl_io_pool_alloc. Mempool can use the preallocated elements, which are
> then returned back to the pool after calling mempool_free() and can be
> reused later. So a ppl_io_unit must be initialized after retrieving it
> from mempool_alloc(). I can leave the memset in ppl_new_iounit() but
> io->header_page pointer will have to be temporarily copied before
> zeroing the iounit and then set back. Is this ok?

Yes, that's preferred.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 1/2] mdadm/manpage:update description for readonly in manpage
From: Wols Lists @ 2017-04-11 16:30 UTC (permalink / raw)
  To: Jes Sorensen, Zhilong Liu; +Cc: linux-raid
In-Reply-To: <6afe1397-3063-c5d7-58ba-f3dbdfa05336@gmail.com>

On 11/04/17 15:51, Jes Sorensen wrote:
>>
>>  .TP
>>  .B \-\-readonly
>> -This will mark an active array as read-only, providing that it is
>> -not currently being used.
>> +This will set an active array as read-only status.
> 
> I think this chance is not good, and should be omitted.

Agreed, except that I would change "providing" to "provided".
> 
>>  .TP
>>  .B \-\-readwrite
>> -This will change a
>> +This will change an
>>  .B readonly
>>  array back to being read/write.
> 
> This change is grammatically incorrect I believe and the original text
> is correct.

Agreed.

Cheers,
Wol

^ permalink raw reply

* Re: [md PATCH 00/10] Simplify bio splitting and related code.
From: Shaohua Li @ 2017-04-11 17:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <149136485390.25893.1797855041954158826.stgit@noble>

On Wed, Apr 05, 2017 at 02:05:50PM +1000, Neil Brown wrote:
> This is part of my little project to make bio splitting
> in Linux uniform and dead-lock free, in a way that will mean
> that we can get rid of all the bioset threads.
> 
> The basic approach is that when a bio needs to be split, we call
> bio_split(), bio_chain() and then generic_make_request().
> We then proceed to handle the remainder without further splitting.
> Recent changes to generic_make_request() ensure that this will
> be safe from deadlocks, providing each bioset is used only once
> in the stack.
> 
> This leads to simpler code in various places.  In particular, the
> splitting of bios that is needed to work around known bad blocks
> is now much less complex.  There is only ever one r1bio per bio.
> 
> As you can see from
>  10 files changed, 335 insertions(+), 540 deletions(-)
> there is a net reduction in code.

Looks good and makes code simpler, applied, thanks Neil! The patch 1 and 6 need
comments in the code to explain how deadlock is avoided though. Care to send a
new patch?

Thanks,
Shaohua

> 
> Thanks,
> NeilBrown
> 
> ---
> 
> NeilBrown (10):
>       md/raid1: simplify the splitting of requests.
>       md/raid1: simplify alloc_behind_master_bio()
>       Revert "block: introduce bio_copy_data_partial"
>       md/raid1: simplify handle_read_error().
>       md/raid1: factor out flush_bio_list()
>       md/raid10: simplify the splitting of requests.
>       md/raid10: simplify handle_read_error()
>       md/raid5: make chunk_aligned_read() split bios more cleanly.
>       md/linear: improve bio splitting.
>       md/raid0: fix up bio splitting.
> 
> 
>  block/bio.c         |   60 ++-------
>  drivers/md/linear.c |   75 +++++------
>  drivers/md/raid0.c  |   73 +++++------
>  drivers/md/raid1.c  |  346 ++++++++++++++++++++-------------------------------
>  drivers/md/raid1.h  |    2 
>  drivers/md/raid10.c |  282 ++++++++++++++----------------------------
>  drivers/md/raid10.h |    1 
>  drivers/md/raid5.c  |   33 +++--
>  drivers/md/raid5.h  |    1 
>  include/linux/bio.h |    2 
>  10 files changed, 335 insertions(+), 540 deletions(-)
> 
> --
> Signature
> 

^ permalink raw reply

* [PATCH v2] raid5-ppl: use a single mempool for ppl_io_unit and header_page
From: Artur Paszkiewicz @ 2017-04-11 18:50 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, songliubraving, Artur Paszkiewicz
In-Reply-To: <20170411154124.nyvgqixjzjs2tzqd@kernel.org>

Allocate both struct ppl_io_unit and its header_page from a shared
mempool to avoid a possible deadlock. Implement allocate and free
functions for the mempool, remove the second pool for allocating
header_page. The header_pages are now freed with their io_units, not
when the ppl bio completes. Also, use GFP_NOWAIT instead of GFP_ATOMIC
for allocating ppl_io_unit because we can handle failed allocations and
there is no reason to utilize emergency reserves.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 53 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 4eb0ebcf9c29..5d25bebf3328 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -102,7 +102,6 @@ struct ppl_conf {
 	struct kmem_cache *io_kc;
 	mempool_t *io_pool;
 	struct bio_set *bs;
-	mempool_t *meta_pool;
 
 	/* used only for recovery */
 	int recovered_entries;
@@ -197,25 +196,55 @@ ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
 	return tx;
 }
 
+static void *ppl_io_pool_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	struct kmem_cache *kc = pool_data;
+	struct ppl_io_unit *io;
+
+	io = kmem_cache_alloc(kc, gfp_mask);
+	if (!io)
+		return NULL;
+
+	io->header_page = alloc_page(gfp_mask);
+	if (!io->header_page) {
+		kmem_cache_free(kc, io);
+		return NULL;
+	}
+
+	return io;
+}
+
+static void ppl_io_pool_free(void *element, void *pool_data)
+{
+	struct kmem_cache *kc = pool_data;
+	struct ppl_io_unit *io = element;
+
+	__free_page(io->header_page);
+	kmem_cache_free(kc, io);
+}
+
 static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
 					  struct stripe_head *sh)
 {
 	struct ppl_conf *ppl_conf = log->ppl_conf;
 	struct ppl_io_unit *io;
 	struct ppl_header *pplhdr;
+	struct page *header_page;
 
-	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
+	io = mempool_alloc(ppl_conf->io_pool, GFP_NOWAIT);
 	if (!io)
 		return NULL;
 
+	header_page = io->header_page;
 	memset(io, 0, sizeof(*io));
+	io->header_page = header_page;
+
 	io->log = log;
 	INIT_LIST_HEAD(&io->log_sibling);
 	INIT_LIST_HEAD(&io->stripe_list);
 	atomic_set(&io->pending_stripes, 0);
 	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
 
-	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
 	pplhdr = page_address(io->header_page);
 	clear_page(pplhdr);
 	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
@@ -371,8 +400,6 @@ static void ppl_log_endio(struct bio *bio)
 	if (bio->bi_error)
 		md_error(ppl_conf->mddev, log->rdev);
 
-	mempool_free(io->header_page, ppl_conf->meta_pool);
-
 	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
 		list_del_init(&sh->log_list);
 
@@ -1007,7 +1034,6 @@ static void __ppl_exit_log(struct ppl_conf *ppl_conf)
 
 	kfree(ppl_conf->child_logs);
 
-	mempool_destroy(ppl_conf->meta_pool);
 	if (ppl_conf->bs)
 		bioset_free(ppl_conf->bs);
 	mempool_destroy(ppl_conf->io_pool);
@@ -1113,25 +1139,20 @@ int ppl_init_log(struct r5conf *conf)
 
 	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
 	if (!ppl_conf->io_kc) {
-		ret = -EINVAL;
+		ret = -ENOMEM;
 		goto err;
 	}
 
-	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
+	ppl_conf->io_pool = mempool_create(conf->raid_disks, ppl_io_pool_alloc,
+					   ppl_io_pool_free, ppl_conf->io_kc);
 	if (!ppl_conf->io_pool) {
-		ret = -EINVAL;
+		ret = -ENOMEM;
 		goto err;
 	}
 
 	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
 	if (!ppl_conf->bs) {
-		ret = -EINVAL;
-		goto err;
-	}
-
-	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
-	if (!ppl_conf->meta_pool) {
-		ret = -EINVAL;
+		ret = -ENOMEM;
 		goto err;
 	}
 
-- 
2.11.0


^ permalink raw reply related

* Re:  Level change from 4 disk RAID5 to 4 disk RAID6
From: LM @ 2017-04-11 21:27 UTC (permalink / raw)
  To: NeilBrown, linux-raid
In-Reply-To: <87vaqdxech.fsf@notabene.neil.brown.name>

On Mon, Apr 10, 2017 at 11:04:30AM +1000, NeilBrown wrote:
>On Sat, Apr 08 2017, LM wrote:
>
>> Hi,
>>
>> I have a 4 disk RAID5, the used dev size is 640.05 GB. Now I want to
>> replace the 4 disks by 4 disks with a size of 2TB each.
>>
>> As far as I understand the man page, this can be achieved by replacing
>> the devices one after another and for each device rebuild the degraded
>> array with:
>>
>>    mdadm /dev/md0 --add /dev/sdX1
>>
>> Then the level change can be done together with growing the array:
>>
>>    mdadm --grow /dev/md0 --level=raid6 --backup-file=/root/backup-md0
>>
>> Does this work?
>>
>> I am asking if it works, because the man page also says:
>>
>>> mdadm --grow /dev/md4 --level=6 --backup-file=/root/backup-md4
>>>        The array /dev/md4 which is currently a RAID5 array will
>>>        be converted to RAID6.  There should normally already be
>>>        a spare drive attached to the array as a RAID6 needs one
>>>        more drive than a matching RAID5.
>>
>> And in my case only the size of disks is increased, not their number.
>>
>
>Yes, it probably works, and you probably don't need a backup file.
>Though you might need to explicitly tell mdadm to keep the number of
>devices unchanged by specifying "--raid-disk=4".
>
>You probably aren't very encouraged that I say "probably" and "might",
>and this is deliberate.
>
>I recommend that you crate 4 10Meg files, use losetup to create 10M
>devices, and build a RAID5 over them with --size=5M.
>Then try the --grow --level=6 command, and see what happens.
>If you mess up, you can easily start from scratch and try again.
>If it works, you can have some confidence that the same process will
>have the same result on real devices.
>
>NeilBrown

Thx, I tried what you suggested and found out it works like this:

* Grow the RAID5 to its maximum size. (mdadm will add a spare device which it
  later will refuse to remove if the array size is not reduced):
* Level change RAID5 -> RAID6 (will create a degraded 5 disk array,
  despite --raid-disk=4)
* Reduce the array size so the 5th disk can be removed
* Remove the 5th disk and normalize the layout



Here is the full log:


Create 4x 10M files:

# fallocate -l 10M A
# fallocate -l 10M B
# fallocate -l 10M C
# fallocate -l 10M D

Create 4x 10M devices:

# losetup /dev/loop10 A
# losetup /dev/loop11 B
# losetup /dev/loop12 C
# losetup /dev/loop13 D

Create a 4 disk RAID5 using 5M of each device:

# mdadm --create /dev/md/test --level=raid5 --size=5M --raid-devices=4 /dev/loop10 /dev/loop11 /dev/loop12 /dev/loop13
> mdadm: largest drive (/dev/loop10) exceeds size (5120K) by more than 1%
> Continue creating array? y
> mdadm: Defaulting to version 1.2 metadata
> mdadm: array /dev/md/test started.

Create a FS on the RAID:

# mkfs.ext4 -T small /dev/md/test
> mke2fs 1.43.3 (04-Sep-2016)
> Creating filesystem with 15360 1k blocks and 3840 inodes
> Filesystem UUID: 0d538322-2e07-463d-9f56-b9d22f5c9f8f
> Superblock backups stored on blocks:
>         8193
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (1024 blocks): done
> Writing superblocks and filesystem accounting information: done

Mount the RAID:

# mount /dev/md/test test/
# ls -al test/
> total 13
> drwxr-xr-x 3 root root  1024 10. Apr 22:50 .
> drwxrwxrwt 5 root root   240 10. Apr 22:49 ..
> drwx------ 2 root root 12288 10. Apr 22:50 lost+found

Store some file on the RAID to see if it survives:

# cd test/
# wget https://www.kernel.org/theme/images/logos/tux.png
> --2017-04-10 22:53:18--  https://www.kernel.org/theme/images/logos/tux.png
> Resolving www.kernel.org (www.kernel.org)... 147.75.205.195, 2604:1380:2000:f000::7
> Connecting to www.kernel.org (www.kernel.org)|147.75.205.195|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 8657 (8,5K) [image/png]
> Saving to: ‘tux.png’
>
> tux.png                       100%[================================================>]   8,45K  --.-KB/s    in 0,001s
>
> 2017-04-10 22:53:19 (6,21 MB/s) - ‘tux.png’ saved [8657/8657]

# feh test/tux.png
# cd ..
# umount test

Details about the RAID:

# mdadm --detail /dev/md/test
> /dev/md/test:
>         Version : 1.2
>   Creation Time : Mon Apr 10 22:50:39 2017
>      Raid Level : raid5
>      Array Size : 15360 (15.00 MiB 15.73 MB)
>   Used Dev Size : 5120 (5.00 MiB 5.24 MB)
>    Raid Devices : 4
>   Total Devices : 4
>     Persistence : Superblock is persistent
>
>     Update Time : Mon Apr 10 22:53:37 2017
>           State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : left-symmetric
>      Chunk Size : 512K
>
>            Name : lars-server:test  (local to host lars-server)
>            UUID : 49095ada:eadf4362:4f5386f5:c615e5bf
>          Events : 18
>
>     Number   Major   Minor   RaidDevice State
>        0       7       10        0      active sync   /dev/loop10
>        1       7       11        1      active sync   /dev/loop11
>        2       7       12        2      active sync   /dev/loop12
>        4       7       13        3      active sync   /dev/loop13

Grow the RAID5 to its maximum size. (mdadm will add a spare device which it
later will refuse to remove if the array size is not reduced):

# mdadm --grow /dev/md/test --size=7680
> mdadm: component size of /dev/md/test has been set to 7680K

See if tux is still alive:

# mount /dev/md/test test/
# feh test/tux.png
# umount test/

Change to level 6:

# mdadm --grow /dev/md/test --level=6 --raid-disk=4 --backup-file=/root/backup-md-test
> mdadm: Need 1 spare to avoid degraded array, and only have 0.
>        Use --force to over-ride this check.

Try to force it:

# mdadm --grow /dev/md/test --level=6 --raid-disk=4 --backup-file=/root/backup-md-test  --force
> mdadm: level of /dev/md/test changed to raid6
> mdadm: this change will reduce the size of the array.
>        use --grow --array-size first to truncate array.
>        e.g. mdadm --grow /dev/md/test --array-size 15360

Reduce the array size:

# mdadm --grow /dev/md/test --array-size 15360

See if tux is still alive:

# mount /dev/md/test test/
# feh test/tux.png
# umount test

Check the size:

# mdadm --detail /dev/md/test
> /dev/md/test:
>         Version : 1.2
>   Creation Time : Mon Apr 10 23:53:10 2017
>      Raid Level : raid6
>      Array Size : 15360 (15.00 MiB 15.73 MB)
>   Used Dev Size : 7680 (7.50 MiB 7.86 MB)
>    Raid Devices : 5
>   Total Devices : 4
>     Persistence : Superblock is persistent
>
>     Update Time : Mon Apr 10 23:57:05 2017
>           State : clean, degraded
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : left-symmetric-6
>      Chunk Size : 512K
>
>            Name : lars-server:test  (local to host lars-server)
>            UUID : 30ce9f41:03cd27d9:0f0317a8:e4117b5c
>          Events : 34
>
>     Number   Major   Minor   RaidDevice State
>        0       7       10        0      active sync   /dev/loop10
>        1       7       11        1      active sync   /dev/loop11
>        2       7       12        2      active sync   /dev/loop12
>        4       7       13        3      active sync   /dev/loop13
>        -       0        0        4      removed

Now remove the 5th spare disk mdadm added:

# mdadm --grow /dev/md/test --raid-disk=4 --layout=normalise --backup-file=/root/backup-md-test
> mdadm: Need to backup 3072K of critical section..

See if it worked:

# mdadm --detail /dev/md/test
> /dev/md/test:
>         Version : 1.2
>   Creation Time : Mon Apr 10 23:53:10 2017
>      Raid Level : raid6
>      Array Size : 15360 (15.00 MiB 15.73 MB)
>   Used Dev Size : 7680 (7.50 MiB 7.86 MB)
>    Raid Devices : 4
>   Total Devices : 4
>     Persistence : Superblock is persistent
>
>     Update Time : Mon Apr 10 23:57:58 2017
>           State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : left-symmetric
>      Chunk Size : 512K
>
>            Name : lars-server:test  (local to host lars-server)
>            UUID : 30ce9f41:03cd27d9:0f0317a8:e4117b5c
>          Events : 46
>
>     Number   Major   Minor   RaidDevice State
>        0       7       10        0      active sync   /dev/loop10
>        1       7       11        1      active sync   /dev/loop11
>        2       7       12        2      active sync   /dev/loop12
>        4       7       13        3      active sync   /dev/loop13

And tux is still alive!

# mount /dev/md/test test/
# feh test/tux.png
# umount test

And the FS is clean, too!

# fsck /dev/md/test
fsck from util-linux 2.28.2
e2fsck 1.43.3 (04-Sep-2016)
/dev/md126: clean, 12/3840 files, 1775/15360 blocks

Clean-up the test setup:

# mdadm --stop /dev/md/test
# losetup -d /dev/loop10
# losetup -d /dev/loop11
# losetup -d /dev/loop12
# losetup -d /dev/loop13
# rm {A..D}

^ permalink raw reply

* Re:  Level change from 4 disk RAID5 to 4 disk RAID6
From: LM @ 2017-04-11 21:28 UTC (permalink / raw)
  To: Wols Lists, linux-raid
In-Reply-To: <58EB1AF4.5010005@youngman.org.uk>

On Mon, Apr 10, 2017 at 06:41:08AM +0100, Wols Lists wrote:
>On 08/04/17 22:42, LM wrote:
>> Hi,
>>
>> I have a 4 disk RAID5, the used dev size is 640.05 GB. Now I want to
>> replace the 4 disks by 4 disks with a size of 2TB each.
>>
>> As far as I understand the man page, this can be achieved by replacing
>> the devices one after another and for each device rebuild the degraded
>> array with:
>>
>>    mdadm /dev/md0 --add /dev/sdX1
>
>Do you have a spare SATA port or whatever your drives are. If so, then
>use the --replace option to mdadm, don't fail then add. You're risking a
>drive failure taking out your array - not a good move.
>
>And if you don't have a spare port, $20 for a PCI card or whatever is a
>good investment to keep your data safe.
>
>Have a look at the raid wiki - it tries to be a bit more verbose and
>easily comprehensible than the man page.
>
>Cheers,
>Wol

Thx for hinting me at --replace, I missed it. Yes, I have a spare SATA
port.

I successfully tested it using loop devices:

mdadm /dev/md/test --add-spare /dev/loop20 --replace /dev/loop10 --with /dev/loop20
mdadm /dev/md/test --add-spare /dev/loop21 --replace /dev/loop11 --with /dev/loop21
mdadm /dev/md/test --add-spare /dev/loop22 --replace /dev/loop12 --with /dev/loop22
mdadm /dev/md/test --add-spare /dev/loop23 --replace /dev/loop13 --with /dev/loop23
mdadm /dev/md/test --remove /dev/loop10 /dev/loop11 /dev/loop12 /dev/loop13
mdadm --grow /dev/md/test --size max
mdadm --grow /dev/md/test --level=6 --raid-disk=4 --backup-file=/root/backup-md-test --force
mdadm --grow /dev/md/test --array-size 407552
mdadm --grow /dev/md/test --raid-disk=4 --layout=normalise --backup-file=/root/backup-md-test


^ permalink raw reply

* Re: [PATCH v2] raid5-ppl: use a single mempool for ppl_io_unit and header_page
From: Shaohua Li @ 2017-04-11 21:58 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, songliubraving
In-Reply-To: <20170411185051.18780-1-artur.paszkiewicz@intel.com>

On Tue, Apr 11, 2017 at 08:50:51PM +0200, Artur Paszkiewicz wrote:
> Allocate both struct ppl_io_unit and its header_page from a shared
> mempool to avoid a possible deadlock. Implement allocate and free
> functions for the mempool, remove the second pool for allocating
> header_page. The header_pages are now freed with their io_units, not
> when the ppl bio completes. Also, use GFP_NOWAIT instead of GFP_ATOMIC
> for allocating ppl_io_unit because we can handle failed allocations and
> there is no reason to utilize emergency reserves.
> 
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>

applied, thanks!
> ---
>  drivers/md/raid5-ppl.c | 53 +++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 37 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index 4eb0ebcf9c29..5d25bebf3328 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -102,7 +102,6 @@ struct ppl_conf {
>  	struct kmem_cache *io_kc;
>  	mempool_t *io_pool;
>  	struct bio_set *bs;
> -	mempool_t *meta_pool;
>  
>  	/* used only for recovery */
>  	int recovered_entries;
> @@ -197,25 +196,55 @@ ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
>  	return tx;
>  }
>  
> +static void *ppl_io_pool_alloc(gfp_t gfp_mask, void *pool_data)
> +{
> +	struct kmem_cache *kc = pool_data;
> +	struct ppl_io_unit *io;
> +
> +	io = kmem_cache_alloc(kc, gfp_mask);
> +	if (!io)
> +		return NULL;
> +
> +	io->header_page = alloc_page(gfp_mask);
> +	if (!io->header_page) {
> +		kmem_cache_free(kc, io);
> +		return NULL;
> +	}
> +
> +	return io;
> +}
> +
> +static void ppl_io_pool_free(void *element, void *pool_data)
> +{
> +	struct kmem_cache *kc = pool_data;
> +	struct ppl_io_unit *io = element;
> +
> +	__free_page(io->header_page);
> +	kmem_cache_free(kc, io);
> +}
> +
>  static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>  					  struct stripe_head *sh)
>  {
>  	struct ppl_conf *ppl_conf = log->ppl_conf;
>  	struct ppl_io_unit *io;
>  	struct ppl_header *pplhdr;
> +	struct page *header_page;
>  
> -	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> +	io = mempool_alloc(ppl_conf->io_pool, GFP_NOWAIT);
>  	if (!io)
>  		return NULL;
>  
> +	header_page = io->header_page;
>  	memset(io, 0, sizeof(*io));
> +	io->header_page = header_page;
> +
>  	io->log = log;
>  	INIT_LIST_HEAD(&io->log_sibling);
>  	INIT_LIST_HEAD(&io->stripe_list);
>  	atomic_set(&io->pending_stripes, 0);
>  	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
>  
> -	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
>  	pplhdr = page_address(io->header_page);
>  	clear_page(pplhdr);
>  	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
> @@ -371,8 +400,6 @@ static void ppl_log_endio(struct bio *bio)
>  	if (bio->bi_error)
>  		md_error(ppl_conf->mddev, log->rdev);
>  
> -	mempool_free(io->header_page, ppl_conf->meta_pool);
> -
>  	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
>  		list_del_init(&sh->log_list);
>  
> @@ -1007,7 +1034,6 @@ static void __ppl_exit_log(struct ppl_conf *ppl_conf)
>  
>  	kfree(ppl_conf->child_logs);
>  
> -	mempool_destroy(ppl_conf->meta_pool);
>  	if (ppl_conf->bs)
>  		bioset_free(ppl_conf->bs);
>  	mempool_destroy(ppl_conf->io_pool);
> @@ -1113,25 +1139,20 @@ int ppl_init_log(struct r5conf *conf)
>  
>  	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
>  	if (!ppl_conf->io_kc) {
> -		ret = -EINVAL;
> +		ret = -ENOMEM;
>  		goto err;
>  	}
>  
> -	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
> +	ppl_conf->io_pool = mempool_create(conf->raid_disks, ppl_io_pool_alloc,
> +					   ppl_io_pool_free, ppl_conf->io_kc);
>  	if (!ppl_conf->io_pool) {
> -		ret = -EINVAL;
> +		ret = -ENOMEM;
>  		goto err;
>  	}
>  
>  	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
>  	if (!ppl_conf->bs) {
> -		ret = -EINVAL;
> -		goto err;
> -	}
> -
> -	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
> -	if (!ppl_conf->meta_pool) {
> -		ret = -EINVAL;
> +		ret = -ENOMEM;
>  		goto err;
>  	}
>  
> -- 
> 2.11.0
> 

^ permalink raw reply

* Re: [md PATCH 00/10] Simplify bio splitting and related code.
From: NeilBrown @ 2017-04-11 23:27 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20170411170112.4txdyjdat63qpsi6@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1482 bytes --]

On Tue, Apr 11 2017, Shaohua Li wrote:

> On Wed, Apr 05, 2017 at 02:05:50PM +1000, Neil Brown wrote:
>> This is part of my little project to make bio splitting
>> in Linux uniform and dead-lock free, in a way that will mean
>> that we can get rid of all the bioset threads.
>> 
>> The basic approach is that when a bio needs to be split, we call
>> bio_split(), bio_chain() and then generic_make_request().
>> We then proceed to handle the remainder without further splitting.
>> Recent changes to generic_make_request() ensure that this will
>> be safe from deadlocks, providing each bioset is used only once
>> in the stack.
>> 
>> This leads to simpler code in various places.  In particular, the
>> splitting of bios that is needed to work around known bad blocks
>> is now much less complex.  There is only ever one r1bio per bio.
>> 
>> As you can see from
>>  10 files changed, 335 insertions(+), 540 deletions(-)
>> there is a net reduction in code.
>
> Looks good and makes code simpler, applied, thanks Neil! The patch 1 and 6 need
> comments in the code to explain how deadlock is avoided though. Care to send a
> new patch?

It isn't clear to me what sort of comment you want, or where it should
go.
It might make sense to have a comment near bio_split() explaining how to
use it (i.e. explaining the pattern used in various patches here), but
I don't see what sort of comments would help in raid1.c or raid10.c
??

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: md-cluster Oops 4.9.13
From: Guoqing Jiang @ 2017-04-12  1:32 UTC (permalink / raw)
  To: Marc Smith; +Cc: linux-raid
In-Reply-To: <CAHkw+Ld=yv=LoBKFUxNO6dEocOKzq-pFR=Tfk50OULXJ8NN9eQ@mail.gmail.com>



On 04/10/2017 09:25 PM, Marc Smith wrote:
> Hi,
>
> Sorry for the delay... I was hoping to cherry-pick this and test
> against 4.9.x, but it didn't apply cleanly, although it looks trivial
> to do it by hand. Is it recommended/okay to test this patch against
> 4.9.x? Will the fix eventually be merged into 4.9.x?

I think you can have a try with the patch then see what will happen, the 
better
way is try with the latest code though people don't like always update 
kernel,
but it is not a material for stable 4.9.x from my understanding.

Thanks,
Guoqing

>
>
> --Marc
>
> On Tue, Apr 4, 2017 at 11:01 PM, Guoqing Jiang <jgq516@gmail.com> wrote:
>>
>> On 04/04/2017 10:06 PM, Marc Smith wrote:
>>> Hi,
>>>
>>> I encountered an oops this morning when stopping a MD array
>>> (md-cluster)... there were 4 md-cluster array started, and they were
>>> in the middle of a rebuild. I stopped the first one and then stopped
>>> the second one immediately after and got the oops, here is a
>>> transcript of what was on my terminal session:
>>>
>>> [root@brimstone-1b ~]# mdadm --stop /dev/md/array1
>>> mdadm: stopped /dev/md/array1
>>> [root@brimstone-1b ~]# mdadm --stop /dev/md/array2
>>>
>>> Message from syslogd@brimstone-1b at Tue Apr  4 09:54:40 2017 ...
>>> brimstone-1b kernel: [649162.174685] BUG: unable to handle kernel NULL
>>> pointer dereference at 0000000000000098
>>>
>>> Using Linux 4.9.13 and here is the output from the kernel messages:
>>>
>>> --snip--
>>> [649158.014731] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: leaving the
>>> lockspace group...
>>> [649158.015233] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8: group event
>>> done 0 0
>>> [649158.015303] dlm: 5b3b8f94-7875-b323-5bb8-29fa6866f4a8:
>>> release_lockspace final free
>>> [649158.015331] md: unbind<nvme0n1p1>
>>> [649158.042540] md: export_rdev(nvme0n1p1)
>>> [649158.042546] md: unbind<nvme1n1p1>
>>> [649158.048501] md: export_rdev(nvme1n1p1)
>>> [649161.759022] md127: detected capacity change from 1000068874240 to 0
>>> [649161.759025] md: md127 stopped.
>>> [649162.174685] BUG: unable to handle kernel NULL pointer dereference
>>> at 0000000000000098
>>> [649162.174727] IP: [<ffffffff81868b40>] recv_daemon+0x1e9/0x373
>>
>> Looks like the recv_daemon is still running after stop array, commit
>> 48df498 "md: move bitmap_destroy to the beginning of __md_stop"
>> ensure it won't happen.
>>
>>
>> [snip]
>>
>>> Perhaps this is already fixed in later versions? Let me know if you
>>> need any additional information.
>>
>> Could you pls try with the latest version? Please let me know if you
>> still see it, thanks.
>>
>> Regards,
>> Guoqing
>>


^ permalink raw reply

* [PATCH 1/2] lib/raid6: Build proper files on corresponding arch
From: Matt Brown @ 2017-04-12  1:35 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: linux-raid, dja

Previously the raid6 test Makefile did not correctly build the files for
testing on PowerPC. This patch fixes the bug, so that all appropriate files
for PowerPC are built.

Signed-off-by: Matt Brown <matthew.brown.dev@gmail.com>
---
 lib/raid6/test/Makefile | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 9c333e9..62b26d1 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -44,10 +44,12 @@ else ifeq ($(HAS_NEON),yes)
         CFLAGS += -DCONFIG_KERNEL_MODE_NEON=1
 else
         HAS_ALTIVEC := $(shell printf '\#include <altivec.h>\nvector int a;\n' |\
-                         gcc -c -x c - >&/dev/null && \
-                         rm ./-.o && echo yes)
+			 gcc -c -x c - >/dev/null && rm ./-.o && echo yes)
         ifeq ($(HAS_ALTIVEC),yes)
-                OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
+		CFLAGS += -I../../../arch/powerpc/include
+		CFLAGS += -DCONFIG_ALTIVEC
+		OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
+			vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
         endif
 endif
 ifeq ($(ARCH),tilegx)
-- 
2.9.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox