Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: Reshape stalled at first badblock location (was: RAID 5 --assemble doesn't recognize all overlays as component devices)
From: George Rapp @ 2017-02-22  1:12 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Linux-RAID, Matthew Krumwiede, NeilBrown, Jes.Sorensen
In-Reply-To: <20170221175801.wt64t2tzcvg3sfmc@kernel.org>

> On Mon, Feb 20, 2017 at 05:18:46PM -0500, George Rapp wrote:
>> On Sat, Feb 11, 2017 at 7:32 PM, George Rapp <george.rapp@gmail.com> wrote:
>> [...snip...]
>>
>> When I try to assemble the RAID 5 array, though, the process gets
>> stuck at the location of the first bad block. The assemble command is:
>>
>> [...snip...]
>>
>> The md4_raid5 process immediately spikes to 100% CPU utilization, and
>> the reshape stops at 1901225472 KiB (which is exactly half of the
>> first bad sector value, 3802454640):
>>
> [...snip...]
On Tue, Feb 21, 2017 at 4:51 AM, Tomasz Majchrzak
<tomasz.majchrzak@intel.com> wrote:
> As long as you're sure the data on the disk is valid, I believe clearing
> bad block list manually in metadata (no easy way to do it) would allow
> reshape to complete.
>
> Tomek
On Tue, Feb 21, 2017 at 12:58 PM, Shaohua Li <shli@kernel.org> wrote:
>
> Add Neil and Jes.
>
> Yes, there were similar reports before. When reshape finds nadblocks, the
> reshape will do an infinite loop without any progress. I think there are two
> things we need to do:
>
> - Make reshape more robust. Maybe reshape should bail out if badblocks found.
> - Add an option in mdadm to force reset badblocks

OK, I examined the structure of the superblock and the badblocks
array. My first attempt was to zero out the bblog_offset and
bblog_size in the md superblock using dd (but that causes the checksum
to be different than the sb_csum in the superblock, and the mdadm
--assemble fails. I didn't want to research how to recalculate the
checksum unless I really, really have to.  8^)

Running mdadm under gdb, I determined that my bblog_offset was 72
sectors from the start of the md superblock), and filled that space
with 0xff characters in my overlay file:

# dd if=/dev/mapper/sdg4 bs=512 count=1 skip=73 of=ffffffff
# dd if=ffffffff of=/dev/mapper/sdg4 bs=512 count=1 seek=72

That convinced mdadm that I have a badblocks list, but it's empty:

# mdadm --examine-badblocks /dev/mapper/sdg4
Bad-blocks on /dev/mapper/sdg4:
#

Once I did that, and restarted the array with my overlay files:

# mdadm --assemble --force /dev/md4
--backup-file=/home/gwr/2017/2017-01/md4_backup__2017-01-25
/dev/mapper/sde4 /dev/mapper/sdf4 /dev/mapper/sdh4 /dev/mapper/sdl4
/dev/mapper/sdg4 /dev/mapper/sdk4 /dev/mapper/sdi4 /dev/mapper/sdj4
/dev/mapper/sdb4
mdadm: accepting backup with timestamp 1485366772 for array with
timestamp 1487645030
mdadm: /dev/md4 has been started with 9 drives (out of 10).
#

The reshape operation got past the two positions where it had frozen
earlier, and didn't throw any obvious errors to /var/log/messages, so
Tomek's suggestion seems to clear the badblocks seems to have worked.
However, this was in the overlay files, not the actual devices.

Before I proceed for real, does clearing the badblocks log and
assembling the array seem like my best option?

-- 
George Rapp  (Pataskala, OH) Home: george.rapp -- at -- gmail.com
LinkedIn profile: https://www.linkedin.com/in/georgerapp
Phone: +1 740 936 RAPP (740 936 7277)

^ permalink raw reply

* Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-02-22  6:12 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra,
	Milan Broz
In-Reply-To: <CAOtvUMcN8s886978vE6T=AE79KkZsLBnXsmhewnb5PB0UyAG3A@mail.gmail.com>

Hi Herbert,

On 8 February 2017 at 13:02, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayan <binoy.jayan@linaro.org> wrote:
>> ===============================================================================
>> dm-crypt optimization for larger block sizes
>> ===============================================================================
>>
>> Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
>> is to move these algorithms from the dm layer to the kernel crypto layer by
>> implementing them as template ciphers so they can be used in relation with
>> algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
>> patchset, the iv-generation code is moved from the dm layer to the crypto layer
>> and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
>> at a time. Each bio contains the in memory representation of physically
>> contiguous disk blocks. Since the bio itself may not be contiguous in main
>> memory, the dm layer sets up a chained scatterlist of these blocks split into
>> physically contiguous segments in memory so that DMA can be performed.

> Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
> Arm64) and it works just fine.
>
> Tested-by: Gilad Ben-Yossef <gilad@benyossef.com>

I was wondering if this is near to be ready for submission (apart from
the testmgr.c
changes) or I need to make some changes to make it similar to the IPSec offload?

Thanks,
Binoy

^ permalink raw reply

* Re: [PATCH] md/linear: shutup lockdep warnning
From: Coly Li @ 2017-02-22  7:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <6ceac5d5f76023fe2b76dc053041c50f81775b12.1487714093.git.shli@fb.com>


> 在 2017年2月22日,上午5:55,Shaohua Li <shli@fb.com> 写道:
> 
> Commit 03a9e24(md linear: fix a race between linear_add() and
> linear_congested()) introduces the warnning.
> 
> Cc: Coly Li <colyli@suse.de>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Coly Li <colyli@suse.de>

Thanks for the fix !

Coly



> ---
> drivers/md/linear.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> index 789008b..5b06b0d 100644
> --- a/drivers/md/linear.c
> +++ b/drivers/md/linear.c
> @@ -224,7 +224,8 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
>     * oldconf until no one uses it anymore.
>     */
>    mddev_suspend(mddev);
> -    oldconf = rcu_dereference(mddev->private);
> +    oldconf = rcu_dereference_protected(mddev->private,
> +            lockdep_is_held(&mddev->reconfig_mutex));
>    mddev->raid_disks++;
>    WARN_ONCE(mddev->raid_disks != newconf->raid_disks,
>        "copied raid_disks doesn't match mddev->raid_disks");
> -- 
> 2.9.3
> 


^ permalink raw reply

* Re: Device size for linux raid5 journal?
From: Christian Samsel @ 2017-02-22  8:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20170221180320.7fovqiv2yore3pog@kernel.org>

On Tuesday, 21 February 2017, 19:03:20 CET  Shaohua Li wrote:
> On Mon, Feb 20, 2017 at 05:50:09PM +0100, Christian Samsel wrote:
> > Hello raid team,
> > First of all, thanks for your work.
> > So i recently read about Linux raid5/raid6 write-cache and journaling and
> > thought about giving it a try. I'm mainly interested in the additional
> > safety provided by the journal but might want be future proof to use the
> > write cache as well.
> > I read how to create an array using a journal but i havent found the
> > slightest indication of how large the respective device/partition should
> > be. I went through the lwn article [1], the slides [2] of the respective
> > engineers at facebook and a few commit messages.
> > 
> > So my question is, let's assume i have a 6TB raid5 array (3x3TB), what
> > would a good journal device size be? I'd probably went with 4GB, as this
> > is kinda the upper bound of what hardware raid controller offer.
> Thanks for trying! Depending on write-through or write-back mode. For
> write-through mode, the size could be just several hundreds megabytes. For
> write-back mode, the size should be a little bigger, several gigabytes, but
> 4 GB should be enough.

Thanks four your response. I managed to get it running. I had a backup ready 
but as the restore would have taken a lot of time i applied the patch "[PATCH 
3/3] Add new journal to array that does not have journal" (4ddd650) to mdadm 
so i could add the journal without to have to recreate the array. That worked 
as well. I think adding the journal introduced a small write speed degregation 
but thats fine for me, as i mostly do reads and data safety is top priority.
I'm using write-through mode as of now. 

> Also I added doc about raid5-cache in
> kernel_source/Documentation/md/raid5-cache.txt recently.
I didnt saw that because i'm still running 4.9. That was helpful but the 
information regarding the size is very vague:

|In write-through mode, the cache disk isn't required to be big. Several
|hundreds megabytes are enough.
...
|Too small cache disk will make the write aggregation less efficient in this
|mode depending on the workloads. It's recommended to use a cache disk with at
|least several gigabytes size in write-back mode.

Some more information like on which factors is the optimal size dependent, 
maybe application type, system memory, array size, disk speed (i dont really 
know, just naming some candidates) and how these factors impact it, would be 
helpful. If it isnt that complex i'd be totally fine to just have it stated 
"The optimal size is X GB".

Besides that, i think a pointer to this documentation inside the mdadm manpage 
would also help. I suppose for a lot of user the documentation inside the 
kernel tree isnt really the first place to look.

Christian


^ permalink raw reply

* Re: [PATCH] md/raid1: handle flush request correctly
From: Coly Li @ 2017-02-22 11:43 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: NeilBrown
In-Reply-To: <45d118f3ebfef696ffc5a725db5ca775360a51b2.1487703748.git.shli@fb.com>

On 2017/2/22 上午3:03, Shaohua Li wrote:
> I got a warning triggered in align_to_barrier_unit_end. It's a flush
> request so sectors == 0. The flush request happens to work well without
> the new barrier patch, but we'd better handle it explictly.
> 
> Cc: Coly Li <colyli@suse.de>
> Cc: NeilBrown <neilb@suse.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Coly Li <colyli@suse.de>

Thanks.

Coly

> ---
>  drivers/md/raid1.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 954d028..e1ee446 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1282,8 +1282,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
>  	unsigned long flags;
>  	const int op = bio_op(bio);
>  	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
> -	const unsigned long do_flush_fua = (bio->bi_opf &
> -						(REQ_PREFLUSH | REQ_FUA));
> +	const unsigned long do_fua = (bio->bi_opf & REQ_FUA);
>  	struct md_rdev *blocked_rdev;
>  	struct blk_plug_cb *cb;
>  	struct raid1_plug_cb *plug = NULL;
> @@ -1509,7 +1508,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
>  				   conf->mirrors[i].rdev->data_offset);
>  		mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
>  		mbio->bi_end_io	= raid1_end_write_request;
> -		bio_set_op_attrs(mbio, op, do_flush_fua | do_sync);
> +		bio_set_op_attrs(mbio, op, do_fua | do_sync);
>  		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
>  		    !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
>  		    conf->raid_disks - mddev->degraded > 1)
> @@ -1565,6 +1564,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>  	struct bio *split;
>  	sector_t sectors;
>  
> +	if (unlikely(bio->bi_opf & REQ_PREFLUSH)) {
> +		md_flush_request(mddev, bio);
> +		return;
> +	}
> +
>  	/* if bio exceeds barrier unit boundary, split it */
>  	sectors = align_to_barrier_unit_end(
>  			bio->bi_iter.bi_sector, bio_sectors(bio));
> 


^ permalink raw reply

* IMSM RAID10: Rebuild/Resync difference on Linux vs Windows
From: Matthias Dahl @ 2017-02-22 13:41 UTC (permalink / raw)
  To: linux-raid

Hello @everyone,

I had an unclean shutdown today and the RAID was (as expected) out of
sync -- so with the next boot, Linux started a resync.

A while into the resync, I had to fire up Windows due to work, and the
Intel Rapid Storage Manager took over the rebuild as expected. But there
was a crucial difference: When I left Linux, I was at somewhat ~35% with
another 2 hours or so to go for the resync, whereas IRSM was at 97% from
the get-go and took just another 10 minutes or so to (apparently) finish
the job.

On Linux, the resync was taking place at 150 MiB/s to 200 MiB/s. So even
if Windows was syncing faster (as-in: transfer-speed wise), there is no 
way to account for a jump from ~35% to 97%.

That is quite a huge difference and got me worried, since I ran into my 
fair share of bugs with imsm on Linux, unfortunately.

Is this to be expected and normal behavior? Does IRSM on Win use some 
kind of optimization/shortcut during rebuild that is not implemented on 
Linux? Or should this really not be happening and I should indeed be 
worried that the resync was done improperly now?

This happened with kernel 4.9.5 and mdadm/mdmon 4.0 with a RAID10.

If there you need any more information, please don't hesitate to ask and
I will gladly provide it and help figure this out.

Thanks in advance for any help.

So long,
Matthias

-- 
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu
  Hire me for contract work: Open Source, Proprietary, Web/Mobile/Desktop

^ permalink raw reply

* Re: Reshape stalled at first badblock location (was: RAID 5 --assemble doesn't recognize all overlays as component devices)
From: Phil Turmel @ 2017-02-22 16:17 UTC (permalink / raw)
  To: George Rapp, Shaohua Li
  Cc: Linux-RAID, Matthew Krumwiede, NeilBrown, Jes.Sorensen
In-Reply-To: <CAF-KpgZ3tZQy93PwUFk0RZRfv1w0_WBRhU+FQ9C4=Hhh44H7KQ@mail.gmail.com>

On 02/21/2017 08:12 PM, George Rapp wrote:

> Before I proceed for real, does clearing the badblocks log and 
> assembling the array seem like my best option?

Yes.  And consider disabling badblocks on all of your arrays.

The badblocks feature does nothing but guarantee that errors
encountered on a device become uncorrectable, even if the cause
of the error was a transient communications or power problem, not
a true media flaw.  Bad block tracking at the OS level is
appropriate for ancient MFM and RLL devices that lack modern
firmware, or similar low-level devices.  And since the MD raid
bad block tracking feature does *not* provide redirects to
usable spare sectors, the feature is useless for such devices,
too.

MD raid currently does *nothing* when a badblock entry reduces
the redundancy available for a particular sector in an array.
The badblocks feature is incomplete, has no upside for modern
component devices, and should not be used.

Phil

^ permalink raw reply

* Re: Reshape stalled at first badblock location (was: RAID 5 --assemble doesn't recognize all overlays as component devices)
From: George Rapp @ 2017-02-22 18:39 UTC (permalink / raw)
  To: Phil Turmel, Shaohua Li
  Cc: Linux-RAID, Matthew Krumwiede, NeilBrown, Jes Sorensen,
	tomasz.majchrzak
In-Reply-To: <a9fc6148-decb-3494-75af-b9280412975e@turmel.org>

On Wed, Feb 22, 2017 at 11:17 AM Phil Turmel <philip@turmel.org> wrote:
>
> On 02/21/2017 08:12 PM, George Rapp wrote:
>
> > Before I proceed for real, does clearing the badblocks log and
> > assembling the array seem like my best option?
>
> Yes.  And consider disabling badblocks on all of your arrays.
>
> The badblocks feature does nothing but guarantee that errors
> encountered on a device become uncorrectable, even if the cause
> of the error was a transient communications or power problem, not
> a true media flaw.  Bad block tracking at the OS level is
> appropriate for ancient MFM and RLL devices that lack modern
> firmware, or similar low-level devices.  And since the MD raid
> bad block tracking feature does *not* provide redirects to
> usable spare sectors, the feature is useless for such devices,
> too.
>
> MD raid currently does *nothing* when a badblock entry reduces
> the redundancy available for a particular sector in an array.
> The badblocks feature is incomplete, has no upside for modern
> component devices, and should not be used.

OK, thanks. I removed the badblocks list as previously proposed and
reassembled the RAID 5 array, and the reshape completed on 9 drives.
I'm re-adding the tenth drive for redundancy before fsck'ing and
mounting the filesystem to see what we've done.  8^)

Current RAID status uploaded to
https://app.box.com/v/raid-status-2017-02-22-txt for them as are
interested.

Thanks to all for your input and help!

George

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-23  5:54 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <c04d68cd-96e8-dff9-c54e-54248394119f@suse.de>

On 2017/2/22 上午4:09, Coly Li wrote:
> On 2017/2/22 上午1:45, Shaohua Li wrote:
>> On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
>>> On 2017/2/21 上午8:29, NeilBrown wrote:
>>>> On Mon, Feb 20 2017, Coly Li wrote:
>>>>
>>>>>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
>>>>>>
>>>>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>>>>>>
>>>>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>>>>>>
>>>>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
>>>>>>>>>> going to change this as Shaohua wantsthe spitting to 
>>>>>>>>>> happen in a separate function, which I agree with, but
>>>>>>>>>> there is something else wrong here. Calling
>>>>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
>>>>>>>>>> It is OK for simple devices, but when one request can
>>>>>>>>>> wait for another request to the same device it can 
>>>>>>>>>> deadlock. This can happen with raid1.  If a resync
>>>>>>>>>> request calls raise_barrier() between one request and
>>>>>>>>>> the next, then the next has to wait for the resync
>>>>>>>>>> request, which has to wait for the first request. As
>>>>>>>>>> the first request will be stuck in the queue in 
>>>>>>>>>> generic_make_request(), you get a deadlock.
>>>>>>>>>
>>>>>>>>> For md raid1, queue in generic_make_request(), can I
>>>>>>>>> understand it as bio_list_on_stack in this function? And
>>>>>>>>> queue in underlying device, can I understand it as the
>>>>>>>>> data structures like plug->pending and 
>>>>>>>>> conf->pending_bio_list ?
>>>>>>>>
>>>>>>>> Yes, the queue in generic_make_request() is the
>>>>>>>> bio_list_on_stack.  That is the only queue I am talking
>>>>>>>> about.  I'm not referring to plug->pending or
>>>>>>>> conf->pending_bio_list at all.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I still don't get the point of deadlock, let me try to
>>>>>>>>> explain why I don't see the possible deadlock. If a bio
>>>>>>>>> is split, and the first part is processed by
>>>>>>>>> make_request_fn(), and then a resync comes and it will 
>>>>>>>>> raise a barrier, there are 3 possible conditions, - the
>>>>>>>>> resync I/O tries to raise barrier on same bucket of the
>>>>>>>>> first regular bio. Then the resync task has to wait to
>>>>>>>>> the first bio drops its conf->nr_pending[idx]
>>>>>>>>
>>>>>>>> Not quite. First, the resync task (in raise_barrier()) will
>>>>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
>>>>>>>> happens immediately. Then the resync_task will increment
>>>>>>>> ->barrier[idx]. Only then will it wait for the first bio to
>>>>>>>> drop ->nr_pending[idx]. The processing of that first bio
>>>>>>>> will have submitted bios to the underlying device, and they
>>>>>>>> will be in the bio_list_on_stack queue, and will not be
>>>>>>>> processed until raid1_make_request() completes.
>>>>>>>>
>>>>>>>> The loop in raid1_make_request() will then call
>>>>>>>> make_request_fn() which will call wait_barrier(), which
>>>>>>>> will wait for ->barrier[idx] to be zero.
>>>>>>>
>>>>>>> Thinking more carefully about this.. the 'idx' that the
>>>>>>> second bio will wait for will normally be different, so there
>>>>>>> won't be a deadlock after all.
>>>>>>>
>>>>>>> However it is possible for hash_long() to produce the same
>>>>>>> idx for two consecutive barrier_units so there is still the
>>>>>>> possibility of a deadlock, though it isn't as likely as I
>>>>>>> thought at first.
>>>>>>
>>>>>> Wrapped the function pointer issue Neil pointed out into Coly's
>>>>>> original patch. Also fix a 'use-after-free' bug. For the
>>>>>> deadlock issue, I'll add below patch, please check.
>>>>>>
>>>>>> Thanks, Shaohua
>>>>>>
>>>>>
>>>
>>> Neil,
>>>
>>> Thanks for your patient explanation, I feel I come to follow up what
>>> you mean. Let me try to re-tell what I understand, correct me if I am
>>> wrong.
>>>
>>>
>>>>> Hmm, please hold, I am still thinking of it. With barrier bucket
>>>>> and hash_long(), I don't see dead lock yet. For raid10 it might
>>>>> happen, but once we have barrier bucket on it , there will no
>>>>> deadlock.
>>>>>
>>>>> My question is, this deadlock only happens when a big bio is
>>>>> split, and the split small bios are continuous, and the resync io
>>>>> visiting barrier buckets in sequntial order too. In the case if
>>>>> adjacent split regular bios or resync bios hit same barrier
>>>>> bucket, it will be a very big failure of hash design, and should
>>>>> have been found already. But no one complain it, so I don't
>>>>> convince myself tje deadlock is real with io barrier buckets
>>>>> (this is what Neil concerns).
>>>>
>>>> I think you are wrong about the design goal of a hash function. 
>>>> When feed a sequence of inputs, with any stride (i.e. with any
>>>> constant difference between consecutive inputs), the output of the
>>>> hash function should appear to be random. A random sequence can
>>>> produce the same number twice in a row. If the hash function
>>>> produces a number from 0 to N-1, you would expect two consecutive
>>>> outputs to be the same about once every N inputs.
>>>>
>>>
>>> Yes, you are right. But when I mentioned hash conflict, I limit the
>>> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
>>> address divided by 64MB I/O barrier unit size, its value range is
>>> reduced to [0, 1<<38].
>>>
>>> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
>>>
>>> For DISCARD bio, its maximum size is 4GB, it could be split into 65
>>> bios at most.
>>>
>>> Then in this patch, the hash question is degraded to: for any
>>> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
>>> these 65 integers into range [0, 1023], will any hash conflict happen
>>> among these integers ?
>>>
>>> I tried a half range [0, 1<<37] to check hash conflict, by writing a
>>> simple code to emulate hash calculation in the new I/O barrier patch,
>>> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
>>> 1<<37] for hash conflict.
>>>
>>> On a 20 core CPU each run spent 7+ hours, finally I find no hash
>>> conflict detected up to 512 consecutive integers in above limited
>>> condition. For 1024, there are a lot hash conflict detected.
>>>
>>> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
>>> almost all existing md raid configuration. So for current kernel
>>> implementation and real world device, for a single bio, there is no
>>> possible hash conflict the new I/O barrier patch.
>>>
>>> If bi_iter.bi_size changes from unsigned int to unsigned long in
>>> future, the above assumption will be wrong. There will be hash
>>> conflict, and potential dead lock, which is quite implicit. Yes, I
>>> agree with you. No, bio split inside loop is not perfect.
>>>
>>>> Even if there was no possibility of a deadlock from a resync
>>>> request happening between two bios, there are other possibilities.
>>>>
>>>
>>> The bellowed text makes me know more about raid1 code, but confuses me
>>> more as well. Here comes my questions,
>>>
>>>> It is not, in general, safe to call mempool_alloc() twice in a
>>>> row, without first ensuring that the first allocation will get
>>>> freed by some other thread.  raid1_write_request() allocates from
>>>> r1bio_pool, and then submits bios to the underlying device, which
>>>> get queued on bio_list_on_stack.  They will not be processed until
>>>> after raid1_make_request() completes, so when raid1_make_request
>>>> loops around and calls raid1_write_request() again, it will try to
>>>> allocate another r1bio from r1bio_pool, and this might end up
>>>> waiting for the r1bio which is trapped and cannot complete.
>>>>
>>>
>>> Can I say that it is because blk_finish_plug() won't be called before
>>> raid1_make_request() returns ? Then in raid1_write_request(), mbio
>>> will be added into plug->pending, but before blk_finish_plug() is
>>> called, they won't be handled.
>>
>> blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
>> current->bio_list, not in plug list.
>>  
> 
> Oops, I messed them up,  thank you for the clarifying :-)
> 
>>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
>>>> impossible.  If 256 threads all attempt a write (or read) that
>>>> crosses a boundary, then they will consume all 256 preallocated
>>>> entries, and want more. If there is no free memory, they will block
>>>> indefinitely.
>>>>
>>>
>>> If raid1_make_request() is modified into this way,
>>> +	if (bio_data_dir(split) == READ)
>>> +		raid1_read_request(mddev, split);
>>> +	else
>>> +		raid1_write_request(mddev, split);
>>> +	if (split != bio)
>>> +		generic_make_request(bio);
>>>
>>> Then the original bio will be added into the bio_list_on_stack of top
>>> level generic_make_request(), current->bio_list is initialized, when
>>> generic_make_request() is called nested in raid1_make_request(), the
>>> split bio will be added into current->bio_list and nothing else happens.
>>>
>>> After the nested generic_make_request() returns, the code back to next
>>> code of generic_make_request(),
>>> 2022                         ret = q->make_request_fn(q, bio);
>>> 2023
>>> 2024                         blk_queue_exit(q);
>>> 2025
>>> 2026                         bio = bio_list_pop(current->bio_list);
>>>
>>> bio_list_pop() will return the second half of the split bio, and it is
>>
>> So in above sequence, the curent->bio_list will has bios in below sequence:
>> bios to underlaying disks, second half of original bio
>>
>> bio_list_pop will pop bios to underlaying disks first, handle them, then the
>> second half of original bio.
>>
>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
>> array, handling the middle layer bio will make the 3rd layer bio hold to
>> bio_list again.
>>
> 
> Could you please give me more hint,
> - What is the meaning of "hold" from " make the 3rd layer bio hold to
> bio_list again" ?
> - Why deadlock happens if the 3rd layer bio hold to bio_list again ?

I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
bucket size to 8MB, running for 10 hours, there is no deadlock observed,

Here is how the 4 layer stacked raid1 setup,
- There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
  /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
  /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
  /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
  /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
- Here is how the 4 layer stacked raid1 assembled, level 1 means the top
level, level 4 means the bottom level in the stacked devices,
  - level 1:
	/dev/md40: /dev/md30  /dev/md31
  - level 2:
	/dev/md30: /dev/md20  /dev/md21
	/dev/md31: /dev/md22  /dev/md23
  - level 3:
	/dev/md20: /dev/md10  /dev/md11
	/dev/md21: /dev/md12  /dev/md13
	/dev/md22: /dev/md14  /dev/md15
	/dev/md23: /dev/md16  /dev/md17
  - level 4:
	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4

Here is the fio job file,
[global]
direct=1
thread=1
ioengine=libaio

[job]
filename=/dev/md40
readwrite=write
numjobs=10
blocksize=33M
iodepth=128
time_based=1
runtime=10h

I planed to learn how the deadlock comes by analyze a deadlock
condition. Maybe it was because 8MB bucket unit size is small enough,
now I try to run with 512K bucket unit size, and see whether I can
encounter a deadlock.


=============== P.S ==============
When I run the stacked raid1 testing, I feel I see something behavior
suspiciously, it is resync.

The second time when I rebuild all the raid1 devices by "mdadm -C
/dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
/dev/md40 already accomplished 50%+ resync. I don't think it could be
that fast...

Coly



^ permalink raw reply

* Re: interesting case of a hung 'recovery'
From: Eyal Lebedinsky @ 2017-02-23  6:17 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org
In-Reply-To: <091a6513-1594-4a33-691a-cd1f7920d4a0@eyal.emu.id.au>

Can one of the experts on the list please comment on the issue.

Regards,
	Eyal

On 18/02/17 23:14, Eyal Lebedinsky wrote:
> I should start by saying that this is an old fedora 19 system
>
> Executive summary: after '--add'ing a new member a 'recovery' starts but 'sync_max' is not reset.
>
> $ uname -a
> Linux e7.eyal.emu.id.au 3.14.27-100.fc19.x86_64 #1 SMP Wed Dec 17 19:36:34 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

$ sudo mdadm --version
mdadm - v4.0 - 2017-01-09

> so the issue may have been fixed since.
>
> I had a disk fail in a raid6. After some 'pending' sectors were logged I decided to do a 'check'
> around that location (set sync_min/max and echo 'check'). Sure enough it elicited disk errors,
> but the disk did not recover and it was kicked out of the array. Moreover it became unresponsive.
> It needed a power cycle so I shutdown and rebooted the machine.
>
> Not one to give up easily I tried the check again, with the same result.
> It was time to '--remove' this array member. I then '--add'ed a new disk which started a recovery.
>
> A few hours later I noticed that it slowed down. A lot. It actually did not progress at all for
> a few hours (I was away from the machine).
>
> As I was staring at the screen (for a long while) I realised that it stopped at 55.5%, and this
> number is exactly where the original 'check' failed (I still do not understand why with my bad
> memory I remembered this number).
>
> I checked 'sync_completed' and it was proper.
> I then examined 'sync_max' and it was wrong - it had the location where the very early 'check'
> failed earlier in the day. It was the same sector where it is now paused at - looks related.
>
> I decided to take a (small) risk and do
>     # echo 'max' >/sys/block/md127/md/sync_max
> at which point the recovery moved on. It should be finished in about 5 hours.
>
> I do not think that it is correct for 'sync_max' to not be set to 'max' when a new member is
> added - it surely requires a full recovery.
>
> I really hope (and expect) that this was actually fixed, but this note may help others facing
> same predicament.
>
> cheers
>

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply

* RE:
From: Qin's Yanjun @ 2017-02-23 15:09 UTC (permalink / raw)



How are you today and your family? I require your attention and honest
co-operation about some issues which i will really want to discuss with you
which.  Looking forward to read from you soon.  

Qin's


______________________________

Sky Silk, http://aknet.kz


^ permalink raw reply

* Re: IMSM RAID10: Rebuild/Resync difference on Linux vs Windows
From: Artur Paszkiewicz @ 2017-02-23 15:27 UTC (permalink / raw)
  To: Matthias Dahl, linux-raid
In-Reply-To: <055fbb58-fb5f-5d69-0974-aaa234c99028@binary-island.eu>

On 02/22/2017 02:41 PM, Matthias Dahl wrote:
> Hello @everyone,
> 
> I had an unclean shutdown today and the RAID was (as expected) out of
> sync -- so with the next boot, Linux started a resync.
> 
> A while into the resync, I had to fire up Windows due to work, and the
> Intel Rapid Storage Manager took over the rebuild as expected. But there
> was a crucial difference: When I left Linux, I was at somewhat ~35% with
> another 2 hours or so to go for the resync, whereas IRSM was at 97% from
> the get-go and took just another 10 minutes or so to (apparently) finish
> the job.
> 
> On Linux, the resync was taking place at 150 MiB/s to 200 MiB/s. So even
> if Windows was syncing faster (as-in: transfer-speed wise), there is no way to account for a jump from ~35% to 97%.
> 
> That is quite a huge difference and got me worried, since I ran into my fair share of bugs with imsm on Linux, unfortunately.
> 
> Is this to be expected and normal behavior? Does IRSM on Win use some kind of optimization/shortcut during rebuild that is not implemented on Linux? Or should this really not be happening and I should indeed be worried that the resync was done improperly now?
> 
> This happened with kernel 4.9.5 and mdadm/mdmon 4.0 with a RAID10.
> 
> If there you need any more information, please don't hesitate to ask and
> I will gladly provide it and help figure this out.
> 
> Thanks in advance for any help.

Windows IRSM should not make any "shortcuts" in this case. If you
suspect that resync was not completed properly in Windows, you can run
it manually by writing "repair" to /sys/block/<dev>/md/sync_action. When
it finishes you can check md/mismatch_cnt if there were any
unsynchronized blocks.

Can you share the output of mdadm -E <raid disks> and mdadm --detail-platform?
What version of RST software are you using on Windows?

Thanks,
Artur

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Shaohua Li @ 2017-02-23 17:34 UTC (permalink / raw)
  To: Coly Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <488f88e0-8111-a320-3abb-c6e6611a957e@suse.de>

On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
> On 2017/2/22 上午4:09, Coly Li wrote:
> > On 2017/2/22 上午1:45, Shaohua Li wrote:
> >> On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
> >>> On 2017/2/21 上午8:29, NeilBrown wrote:
> >>>> On Mon, Feb 20 2017, Coly Li wrote:
> >>>>
> >>>>>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
> >>>>>>
> >>>>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> >>>>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
> >>>>>>>>
> >>>>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
> >>>>>>>>>
> >>>>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
> >>>>>>>>>> going to change this as Shaohua wantsthe spitting to 
> >>>>>>>>>> happen in a separate function, which I agree with, but
> >>>>>>>>>> there is something else wrong here. Calling
> >>>>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
> >>>>>>>>>> It is OK for simple devices, but when one request can
> >>>>>>>>>> wait for another request to the same device it can 
> >>>>>>>>>> deadlock. This can happen with raid1.  If a resync
> >>>>>>>>>> request calls raise_barrier() between one request and
> >>>>>>>>>> the next, then the next has to wait for the resync
> >>>>>>>>>> request, which has to wait for the first request. As
> >>>>>>>>>> the first request will be stuck in the queue in 
> >>>>>>>>>> generic_make_request(), you get a deadlock.
> >>>>>>>>>
> >>>>>>>>> For md raid1, queue in generic_make_request(), can I
> >>>>>>>>> understand it as bio_list_on_stack in this function? And
> >>>>>>>>> queue in underlying device, can I understand it as the
> >>>>>>>>> data structures like plug->pending and 
> >>>>>>>>> conf->pending_bio_list ?
> >>>>>>>>
> >>>>>>>> Yes, the queue in generic_make_request() is the
> >>>>>>>> bio_list_on_stack.  That is the only queue I am talking
> >>>>>>>> about.  I'm not referring to plug->pending or
> >>>>>>>> conf->pending_bio_list at all.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I still don't get the point of deadlock, let me try to
> >>>>>>>>> explain why I don't see the possible deadlock. If a bio
> >>>>>>>>> is split, and the first part is processed by
> >>>>>>>>> make_request_fn(), and then a resync comes and it will 
> >>>>>>>>> raise a barrier, there are 3 possible conditions, - the
> >>>>>>>>> resync I/O tries to raise barrier on same bucket of the
> >>>>>>>>> first regular bio. Then the resync task has to wait to
> >>>>>>>>> the first bio drops its conf->nr_pending[idx]
> >>>>>>>>
> >>>>>>>> Not quite. First, the resync task (in raise_barrier()) will
> >>>>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
> >>>>>>>> happens immediately. Then the resync_task will increment
> >>>>>>>> ->barrier[idx]. Only then will it wait for the first bio to
> >>>>>>>> drop ->nr_pending[idx]. The processing of that first bio
> >>>>>>>> will have submitted bios to the underlying device, and they
> >>>>>>>> will be in the bio_list_on_stack queue, and will not be
> >>>>>>>> processed until raid1_make_request() completes.
> >>>>>>>>
> >>>>>>>> The loop in raid1_make_request() will then call
> >>>>>>>> make_request_fn() which will call wait_barrier(), which
> >>>>>>>> will wait for ->barrier[idx] to be zero.
> >>>>>>>
> >>>>>>> Thinking more carefully about this.. the 'idx' that the
> >>>>>>> second bio will wait for will normally be different, so there
> >>>>>>> won't be a deadlock after all.
> >>>>>>>
> >>>>>>> However it is possible for hash_long() to produce the same
> >>>>>>> idx for two consecutive barrier_units so there is still the
> >>>>>>> possibility of a deadlock, though it isn't as likely as I
> >>>>>>> thought at first.
> >>>>>>
> >>>>>> Wrapped the function pointer issue Neil pointed out into Coly's
> >>>>>> original patch. Also fix a 'use-after-free' bug. For the
> >>>>>> deadlock issue, I'll add below patch, please check.
> >>>>>>
> >>>>>> Thanks, Shaohua
> >>>>>>
> >>>>>
> >>>
> >>> Neil,
> >>>
> >>> Thanks for your patient explanation, I feel I come to follow up what
> >>> you mean. Let me try to re-tell what I understand, correct me if I am
> >>> wrong.
> >>>
> >>>
> >>>>> Hmm, please hold, I am still thinking of it. With barrier bucket
> >>>>> and hash_long(), I don't see dead lock yet. For raid10 it might
> >>>>> happen, but once we have barrier bucket on it , there will no
> >>>>> deadlock.
> >>>>>
> >>>>> My question is, this deadlock only happens when a big bio is
> >>>>> split, and the split small bios are continuous, and the resync io
> >>>>> visiting barrier buckets in sequntial order too. In the case if
> >>>>> adjacent split regular bios or resync bios hit same barrier
> >>>>> bucket, it will be a very big failure of hash design, and should
> >>>>> have been found already. But no one complain it, so I don't
> >>>>> convince myself tje deadlock is real with io barrier buckets
> >>>>> (this is what Neil concerns).
> >>>>
> >>>> I think you are wrong about the design goal of a hash function. 
> >>>> When feed a sequence of inputs, with any stride (i.e. with any
> >>>> constant difference between consecutive inputs), the output of the
> >>>> hash function should appear to be random. A random sequence can
> >>>> produce the same number twice in a row. If the hash function
> >>>> produces a number from 0 to N-1, you would expect two consecutive
> >>>> outputs to be the same about once every N inputs.
> >>>>
> >>>
> >>> Yes, you are right. But when I mentioned hash conflict, I limit the
> >>> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
> >>> address divided by 64MB I/O barrier unit size, its value range is
> >>> reduced to [0, 1<<38].
> >>>
> >>> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
> >>>
> >>> For DISCARD bio, its maximum size is 4GB, it could be split into 65
> >>> bios at most.
> >>>
> >>> Then in this patch, the hash question is degraded to: for any
> >>> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
> >>> these 65 integers into range [0, 1023], will any hash conflict happen
> >>> among these integers ?
> >>>
> >>> I tried a half range [0, 1<<37] to check hash conflict, by writing a
> >>> simple code to emulate hash calculation in the new I/O barrier patch,
> >>> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
> >>> 1<<37] for hash conflict.
> >>>
> >>> On a 20 core CPU each run spent 7+ hours, finally I find no hash
> >>> conflict detected up to 512 consecutive integers in above limited
> >>> condition. For 1024, there are a lot hash conflict detected.
> >>>
> >>> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
> >>> almost all existing md raid configuration. So for current kernel
> >>> implementation and real world device, for a single bio, there is no
> >>> possible hash conflict the new I/O barrier patch.
> >>>
> >>> If bi_iter.bi_size changes from unsigned int to unsigned long in
> >>> future, the above assumption will be wrong. There will be hash
> >>> conflict, and potential dead lock, which is quite implicit. Yes, I
> >>> agree with you. No, bio split inside loop is not perfect.
> >>>
> >>>> Even if there was no possibility of a deadlock from a resync
> >>>> request happening between two bios, there are other possibilities.
> >>>>
> >>>
> >>> The bellowed text makes me know more about raid1 code, but confuses me
> >>> more as well. Here comes my questions,
> >>>
> >>>> It is not, in general, safe to call mempool_alloc() twice in a
> >>>> row, without first ensuring that the first allocation will get
> >>>> freed by some other thread.  raid1_write_request() allocates from
> >>>> r1bio_pool, and then submits bios to the underlying device, which
> >>>> get queued on bio_list_on_stack.  They will not be processed until
> >>>> after raid1_make_request() completes, so when raid1_make_request
> >>>> loops around and calls raid1_write_request() again, it will try to
> >>>> allocate another r1bio from r1bio_pool, and this might end up
> >>>> waiting for the r1bio which is trapped and cannot complete.
> >>>>
> >>>
> >>> Can I say that it is because blk_finish_plug() won't be called before
> >>> raid1_make_request() returns ? Then in raid1_write_request(), mbio
> >>> will be added into plug->pending, but before blk_finish_plug() is
> >>> called, they won't be handled.
> >>
> >> blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
> >> current->bio_list, not in plug list.
> >>  
> > 
> > Oops, I messed them up,  thank you for the clarifying :-)
> > 
> >>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> >>>> impossible.  If 256 threads all attempt a write (or read) that
> >>>> crosses a boundary, then they will consume all 256 preallocated
> >>>> entries, and want more. If there is no free memory, they will block
> >>>> indefinitely.
> >>>>
> >>>
> >>> If raid1_make_request() is modified into this way,
> >>> +	if (bio_data_dir(split) == READ)
> >>> +		raid1_read_request(mddev, split);
> >>> +	else
> >>> +		raid1_write_request(mddev, split);
> >>> +	if (split != bio)
> >>> +		generic_make_request(bio);
> >>>
> >>> Then the original bio will be added into the bio_list_on_stack of top
> >>> level generic_make_request(), current->bio_list is initialized, when
> >>> generic_make_request() is called nested in raid1_make_request(), the
> >>> split bio will be added into current->bio_list and nothing else happens.
> >>>
> >>> After the nested generic_make_request() returns, the code back to next
> >>> code of generic_make_request(),
> >>> 2022                         ret = q->make_request_fn(q, bio);
> >>> 2023
> >>> 2024                         blk_queue_exit(q);
> >>> 2025
> >>> 2026                         bio = bio_list_pop(current->bio_list);
> >>>
> >>> bio_list_pop() will return the second half of the split bio, and it is
> >>
> >> So in above sequence, the curent->bio_list will has bios in below sequence:
> >> bios to underlaying disks, second half of original bio
> >>
> >> bio_list_pop will pop bios to underlaying disks first, handle them, then the
> >> second half of original bio.
> >>
> >> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
> >> array, handling the middle layer bio will make the 3rd layer bio hold to
> >> bio_list again.
> >>
> > 
> > Could you please give me more hint,
> > - What is the meaning of "hold" from " make the 3rd layer bio hold to
> > bio_list again" ?
> > - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
> 
> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
> 
> Here is how the 4 layer stacked raid1 setup,
> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
> level, level 4 means the bottom level in the stacked devices,
>   - level 1:
> 	/dev/md40: /dev/md30  /dev/md31
>   - level 2:
> 	/dev/md30: /dev/md20  /dev/md21
> 	/dev/md31: /dev/md22  /dev/md23
>   - level 3:
> 	/dev/md20: /dev/md10  /dev/md11
> 	/dev/md21: /dev/md12  /dev/md13
> 	/dev/md22: /dev/md14  /dev/md15
> 	/dev/md23: /dev/md16  /dev/md17
>   - level 4:
> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
> 
> Here is the fio job file,
> [global]
> direct=1
> thread=1
> ioengine=libaio
> 
> [job]
> filename=/dev/md40
> readwrite=write
> numjobs=10
> blocksize=33M
> iodepth=128
> time_based=1
> runtime=10h
> 
> I planed to learn how the deadlock comes by analyze a deadlock
> condition. Maybe it was because 8MB bucket unit size is small enough,
> now I try to run with 512K bucket unit size, and see whether I can
> encounter a deadlock.

Don't think raid1 could easily trigger the deadlock. Maybe you should try
raid10. The resync case is hard to trigger for raid1. The memory pressure case
is hard to trigger for both raid1/10. But it's possible to trigger.

The 3-layer case is something like this:
1. in level1, set current->bio_list, split bio to bio1 and bio2
2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
3. queue bio2 in current->bio_list
4. generic_make_request then pops bio1-level2
5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock

The problem is because we add new bio to current->bio_list tail.

> =============== P.S ==============
> When I run the stacked raid1 testing, I feel I see something behavior
> suspiciously, it is resync.
> 
> The second time when I rebuild all the raid1 devices by "mdadm -C
> /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
> /dev/md40 already accomplished 50%+ resync. I don't think it could be
> that fast...

no idea, is this reproducible?

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-23 19:31 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <20170223173418.qrgjpj6utgceg5f4@kernel.org>

On 2017/2/24 上午1:34, Shaohua Li wrote:
> On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
[snip]
>>>>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
>>>>>> impossible.  If 256 threads all attempt a write (or read) that
>>>>>> crosses a boundary, then they will consume all 256 preallocated
>>>>>> entries, and want more. If there is no free memory, they will block
>>>>>> indefinitely.
>>>>>>
>>>>>
>>>>> If raid1_make_request() is modified into this way,
>>>>> +	if (bio_data_dir(split) == READ)
>>>>> +		raid1_read_request(mddev, split);
>>>>> +	else
>>>>> +		raid1_write_request(mddev, split);
>>>>> +	if (split != bio)
>>>>> +		generic_make_request(bio);
>>>>>
>>>>> Then the original bio will be added into the bio_list_on_stack of top
>>>>> level generic_make_request(), current->bio_list is initialized, when
>>>>> generic_make_request() is called nested in raid1_make_request(), the
>>>>> split bio will be added into current->bio_list and nothing else happens.
>>>>>
>>>>> After the nested generic_make_request() returns, the code back to next
>>>>> code of generic_make_request(),
>>>>> 2022                         ret = q->make_request_fn(q, bio);
>>>>> 2023
>>>>> 2024                         blk_queue_exit(q);
>>>>> 2025
>>>>> 2026                         bio = bio_list_pop(current->bio_list);
>>>>>
>>>>> bio_list_pop() will return the second half of the split bio, and it is
>>>>
>>>> So in above sequence, the curent->bio_list will has bios in below sequence:
>>>> bios to underlaying disks, second half of original bio
>>>>
>>>> bio_list_pop will pop bios to underlaying disks first, handle them, then the
>>>> second half of original bio.
>>>>
>>>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
>>>> array, handling the middle layer bio will make the 3rd layer bio hold to
>>>> bio_list again.
>>>>
>>>
>>> Could you please give me more hint,
>>> - What is the meaning of "hold" from " make the 3rd layer bio hold to
>>> bio_list again" ?
>>> - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
>>
>> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
>> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
>>
>> Here is how the 4 layer stacked raid1 setup,
>> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
>>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
>>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
>>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
>>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
>> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
>> level, level 4 means the bottom level in the stacked devices,
>>   - level 1:
>> 	/dev/md40: /dev/md30  /dev/md31
>>   - level 2:
>> 	/dev/md30: /dev/md20  /dev/md21
>> 	/dev/md31: /dev/md22  /dev/md23
>>   - level 3:
>> 	/dev/md20: /dev/md10  /dev/md11
>> 	/dev/md21: /dev/md12  /dev/md13
>> 	/dev/md22: /dev/md14  /dev/md15
>> 	/dev/md23: /dev/md16  /dev/md17
>>   - level 4:
>> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
>> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
>> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
>> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
>> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
>> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
>> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
>> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
>>
>> Here is the fio job file,
>> [global]
>> direct=1
>> thread=1
>> ioengine=libaio
>>
>> [job]
>> filename=/dev/md40
>> readwrite=write
>> numjobs=10
>> blocksize=33M
>> iodepth=128
>> time_based=1
>> runtime=10h
>>
>> I planed to learn how the deadlock comes by analyze a deadlock
>> condition. Maybe it was because 8MB bucket unit size is small enough,
>> now I try to run with 512K bucket unit size, and see whether I can
>> encounter a deadlock.
> 
> Don't think raid1 could easily trigger the deadlock. Maybe you should try
> raid10. The resync case is hard to trigger for raid1. The memory pressure case
> is hard to trigger for both raid1/10. But it's possible to trigger.
> 
> The 3-layer case is something like this:

Hi Shaohua,

I try to catch up with you, let me try to follow your mind by the
split-in-while-loop condition (this is my new I/O barrier patch). I
assume the original BIO is a write bio, and original bio is split and
handled in a while loop in raid1_make_request().

> 1. in level1, set current->bio_list, split bio to bio1 and bio2

This is done in level1 raid1_make_request().

> 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list

Remap is done by raid1_write_request(), and bio1_level may be added into
one of the two list:
- plug->pending:
  bios in plug->pending may be handled in raid1_unplug(), or in
flush_pending_writes() of raid1d().
  If current task is about to be scheduled, raid1_unplug() will merge
plug->pending's bios to conf->pending_bio_list. And
conf->pending_bio_list will be handled in raid1d.
  If raid1_unplug() is triggered by blk_finish_plug(), it is also
handled in raid1d.

- conf->pending_bio_list:
  bios in this list is handled in raid1d by calling flush_pending_writes().


So generic_make_request() to handle bio1_level2 can only be called in
context of raid1d thread, bio1_level2 is added into raid1d's
bio_list_on_stack, not caller of level1 generic_make_request().

> 3. queue bio2 in current->bio_list

Same, bio2_level2 is in level1 raid1d's bio_list_on_stack.
Then back to level1 generic_make_request()

> 4. generic_make_request then pops bio1-level2

At this moment, bio1_level2 and bio2_level2 are in either plug->pending
or conf->pending_bio_list, bio_list_pop() returns NULL, and level1
generic_make_request() returns to its caller.

If before bio_list_pop() called, kernel thread raid1d wakes up and
iterates conf->pending_bio_list in flush_pending_writes() or iterate
plug->pending in raid1_unplug() by blk_finish_plug(), that happens in
level1 raid1d's stack, bios will not show up in level1
generic_make_reques(), bio_list_pop() still returns NULL.

> 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list

bio2_level2 is at head of conf->pending_bio_list or plug->pending, so
bio2_level2 is handled firstly.

level1 raid1 calls level2 generic_make_request(), then level2
raid1_make_request() is called, then level raid1_write_request().
bio2_level2 is remapped to bio2_level3, added into plug->pending (level1
raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it
will be handled by level2 raid1d, when level2 raid1d wakes up.
Then returns back to level1 raid1, bio1_level2
is handled by level2 generic_make_request() and added into level2
plug->pending or conf->pending_bio_list. In this case neither
bio2_level2 nor bio1_level is added into any bio_list_on_stack.

Then level1 raid1d handles all bios in level1 conf->pending_bio_list,
and sleeps.

Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by
iterate level2 plug->pending or conf->pending_bio_list, and calling
level3 generic_make_request().

In level3 generic_make_request(), because it is level2 raid1d context,
not level1 raid1d context, bio2_level3 is send into
q->make_request_fn(), and finally added into level3 plug->pending or
conf->pending_bio_list, then back to level3 generic_make_reqeust().

Now level2 raid1d's current->bio_list is empty, so level3
generic_make_request() returns to level2 raid1d and continue to iterate
and send bio1_level3 into level3 generic_make_request().

After all bios are added into level3 plug->pending or
conf->pending_bio_list, level2 raid1d sleeps.

Now level3 raid1d wakes up, continue to iterate level3 plug->pending or
conf->pending_bio_list by calling generic_make_request() to underlying
devices (which might be a read device).

On the above whole patch, each lower level generic_make_request() is
called in context of the lower level raid1d. No recursive call happens
in normal code path.

In raid1 code, recursive call of generic_make_request() only happens for
READ bio, but if array is not frozen, no barrier is required, it doesn't
hurt.


> 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock

As my understand to the code, it won't happen neither.

> 
> The problem is because we add new bio to current->bio_list tail.

New bios are added into other context's current->bio_list, which are
different lists. If what I understand is correct, a dead lock won't
happen in this way.

If my understanding is correct, suddenly I come to realize why raid1
bios are handled indirectly in another kernel thread.

(Just for your information, when I write to this location, another run
of testing finished, no deadlock. This time I reduce I/O barrier bucket
unit size to 512KB, and set blocksize to 33MB in fio job file. It is
really slow (130MB/s), but no deadlock observed)


The stacked raid1 devices are really really confused, if I am wrong, any
hint is warmly welcome.

> 
>> =============== P.S ==============
>> When I run the stacked raid1 testing, I feel I see something behavior
>> suspiciously, it is resync.
>>
>> The second time when I rebuild all the raid1 devices by "mdadm -C
>> /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
>> /dev/md40 already accomplished 50%+ resync. I don't think it could be
>> that fast...
> 
> no idea, is this reproducible?

It can be stably reproduced. I need to check whether bitmap is cleaned
when create a stacked raid1. This is a little off topic in this thread,
once I have some idea, I will send out another topic. Hope it is just so
fast.

Coly

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Shaohua Li @ 2017-02-23 19:58 UTC (permalink / raw)
  To: Coly Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <b6134a31-6cf6-3ea4-a3e5-0073c38c3a51@suse.de>

On Fri, Feb 24, 2017 at 03:31:16AM +0800, Coly Li wrote:
> On 2017/2/24 上午1:34, Shaohua Li wrote:
> > On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
> [snip]
> >>>>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> >>>>>> impossible.  If 256 threads all attempt a write (or read) that
> >>>>>> crosses a boundary, then they will consume all 256 preallocated
> >>>>>> entries, and want more. If there is no free memory, they will block
> >>>>>> indefinitely.
> >>>>>>
> >>>>>
> >>>>> If raid1_make_request() is modified into this way,
> >>>>> +	if (bio_data_dir(split) == READ)
> >>>>> +		raid1_read_request(mddev, split);
> >>>>> +	else
> >>>>> +		raid1_write_request(mddev, split);
> >>>>> +	if (split != bio)
> >>>>> +		generic_make_request(bio);
> >>>>>
> >>>>> Then the original bio will be added into the bio_list_on_stack of top
> >>>>> level generic_make_request(), current->bio_list is initialized, when
> >>>>> generic_make_request() is called nested in raid1_make_request(), the
> >>>>> split bio will be added into current->bio_list and nothing else happens.
> >>>>>
> >>>>> After the nested generic_make_request() returns, the code back to next
> >>>>> code of generic_make_request(),
> >>>>> 2022                         ret = q->make_request_fn(q, bio);
> >>>>> 2023
> >>>>> 2024                         blk_queue_exit(q);
> >>>>> 2025
> >>>>> 2026                         bio = bio_list_pop(current->bio_list);
> >>>>>
> >>>>> bio_list_pop() will return the second half of the split bio, and it is
> >>>>
> >>>> So in above sequence, the curent->bio_list will has bios in below sequence:
> >>>> bios to underlaying disks, second half of original bio
> >>>>
> >>>> bio_list_pop will pop bios to underlaying disks first, handle them, then the
> >>>> second half of original bio.
> >>>>
> >>>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
> >>>> array, handling the middle layer bio will make the 3rd layer bio hold to
> >>>> bio_list again.
> >>>>
> >>>
> >>> Could you please give me more hint,
> >>> - What is the meaning of "hold" from " make the 3rd layer bio hold to
> >>> bio_list again" ?
> >>> - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
> >>
> >> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
> >> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
> >>
> >> Here is how the 4 layer stacked raid1 setup,
> >> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
> >>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
> >>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
> >>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
> >>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
> >> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
> >> level, level 4 means the bottom level in the stacked devices,
> >>   - level 1:
> >> 	/dev/md40: /dev/md30  /dev/md31
> >>   - level 2:
> >> 	/dev/md30: /dev/md20  /dev/md21
> >> 	/dev/md31: /dev/md22  /dev/md23
> >>   - level 3:
> >> 	/dev/md20: /dev/md10  /dev/md11
> >> 	/dev/md21: /dev/md12  /dev/md13
> >> 	/dev/md22: /dev/md14  /dev/md15
> >> 	/dev/md23: /dev/md16  /dev/md17
> >>   - level 4:
> >> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
> >> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
> >> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
> >> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
> >> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
> >> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
> >> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
> >> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
> >>
> >> Here is the fio job file,
> >> [global]
> >> direct=1
> >> thread=1
> >> ioengine=libaio
> >>
> >> [job]
> >> filename=/dev/md40
> >> readwrite=write
> >> numjobs=10
> >> blocksize=33M
> >> iodepth=128
> >> time_based=1
> >> runtime=10h
> >>
> >> I planed to learn how the deadlock comes by analyze a deadlock
> >> condition. Maybe it was because 8MB bucket unit size is small enough,
> >> now I try to run with 512K bucket unit size, and see whether I can
> >> encounter a deadlock.
> > 
> > Don't think raid1 could easily trigger the deadlock. Maybe you should try
> > raid10. The resync case is hard to trigger for raid1. The memory pressure case
> > is hard to trigger for both raid1/10. But it's possible to trigger.
> > 
> > The 3-layer case is something like this:
> 
> Hi Shaohua,
> 
> I try to catch up with you, let me try to follow your mind by the
> split-in-while-loop condition (this is my new I/O barrier patch). I
> assume the original BIO is a write bio, and original bio is split and
> handled in a while loop in raid1_make_request().
> 
> > 1. in level1, set current->bio_list, split bio to bio1 and bio2
> 
> This is done in level1 raid1_make_request().
> 
> > 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
> 
> Remap is done by raid1_write_request(), and bio1_level may be added into
> one of the two list:
> - plug->pending:
>   bios in plug->pending may be handled in raid1_unplug(), or in
> flush_pending_writes() of raid1d().
>   If current task is about to be scheduled, raid1_unplug() will merge
> plug->pending's bios to conf->pending_bio_list. And
> conf->pending_bio_list will be handled in raid1d.
>   If raid1_unplug() is triggered by blk_finish_plug(), it is also
> handled in raid1d.
> 
> - conf->pending_bio_list:
>   bios in this list is handled in raid1d by calling flush_pending_writes().
> 
> 
> So generic_make_request() to handle bio1_level2 can only be called in
> context of raid1d thread, bio1_level2 is added into raid1d's
> bio_list_on_stack, not caller of level1 generic_make_request().
> 
> > 3. queue bio2 in current->bio_list
> 
> Same, bio2_level2 is in level1 raid1d's bio_list_on_stack.
> Then back to level1 generic_make_request()
> 
> > 4. generic_make_request then pops bio1-level2
> 
> At this moment, bio1_level2 and bio2_level2 are in either plug->pending
> or conf->pending_bio_list, bio_list_pop() returns NULL, and level1
> generic_make_request() returns to its caller.
> 
> If before bio_list_pop() called, kernel thread raid1d wakes up and
> iterates conf->pending_bio_list in flush_pending_writes() or iterate
> plug->pending in raid1_unplug() by blk_finish_plug(), that happens in
> level1 raid1d's stack, bios will not show up in level1
> generic_make_reques(), bio_list_pop() still returns NULL.
> 
> > 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
> 
> bio2_level2 is at head of conf->pending_bio_list or plug->pending, so
> bio2_level2 is handled firstly.
> 
> level1 raid1 calls level2 generic_make_request(), then level2
> raid1_make_request() is called, then level raid1_write_request().
> bio2_level2 is remapped to bio2_level3, added into plug->pending (level1
> raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it
> will be handled by level2 raid1d, when level2 raid1d wakes up.
> Then returns back to level1 raid1, bio1_level2
> is handled by level2 generic_make_request() and added into level2
> plug->pending or conf->pending_bio_list. In this case neither
> bio2_level2 nor bio1_level is added into any bio_list_on_stack.
> 
> Then level1 raid1d handles all bios in level1 conf->pending_bio_list,
> and sleeps.
> 
> Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by
> iterate level2 plug->pending or conf->pending_bio_list, and calling
> level3 generic_make_request().
> 
> In level3 generic_make_request(), because it is level2 raid1d context,
> not level1 raid1d context, bio2_level3 is send into
> q->make_request_fn(), and finally added into level3 plug->pending or
> conf->pending_bio_list, then back to level3 generic_make_reqeust().
> 
> Now level2 raid1d's current->bio_list is empty, so level3
> generic_make_request() returns to level2 raid1d and continue to iterate
> and send bio1_level3 into level3 generic_make_request().
> 
> After all bios are added into level3 plug->pending or
> conf->pending_bio_list, level2 raid1d sleeps.
> 
> Now level3 raid1d wakes up, continue to iterate level3 plug->pending or
> conf->pending_bio_list by calling generic_make_request() to underlying
> devices (which might be a read device).
> 
> On the above whole patch, each lower level generic_make_request() is
> called in context of the lower level raid1d. No recursive call happens
> in normal code path.
> 
> In raid1 code, recursive call of generic_make_request() only happens for
> READ bio, but if array is not frozen, no barrier is required, it doesn't
> hurt.
> 
> 
> > 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock
> 
> As my understand to the code, it won't happen neither.
> 
> > 
> > The problem is because we add new bio to current->bio_list tail.
> 
> New bios are added into other context's current->bio_list, which are
> different lists. If what I understand is correct, a dead lock won't
> happen in this way.
> 
> If my understanding is correct, suddenly I come to realize why raid1
> bios are handled indirectly in another kernel thread.
> 
> (Just for your information, when I write to this location, another run
> of testing finished, no deadlock. This time I reduce I/O barrier bucket
> unit size to 512KB, and set blocksize to 33MB in fio job file. It is
> really slow (130MB/s), but no deadlock observed)
> 
> 
> The stacked raid1 devices are really really confused, if I am wrong, any
> hint is warmly welcome.

Aha, you are correct. I missed we never directly dispatch bio in a schedule based
blk-plug flush. I'll drop the patch. Thanks for the insistence, good discussion!

Thanks,
Shaohua

^ permalink raw reply

* [PATCH 1/2] md/raid10: submit bio directly to replacement disk
From: Shaohua Li @ 2017-02-23 20:34 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

Commit 57c67df(md/raid10: submit IO from originating thread instead of
md thread) submits bio directly for normal disks but not for replacement
disks. There is no point we shouldn't do this for replacement disks.

Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid10.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a1f8e98..4e85a61 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1477,11 +1477,24 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
 			mbio->bi_bdev = (void*)rdev;
 
 			atomic_inc(&r10_bio->remaining);
+
+			cb = blk_check_plugged(raid10_unplug, mddev,
+					       sizeof(*plug));
+			if (cb)
+				plug = container_of(cb, struct raid10_plug_cb,
+						    cb);
+			else
+				plug = NULL;
 			spin_lock_irqsave(&conf->device_lock, flags);
-			bio_list_add(&conf->pending_bio_list, mbio);
-			conf->pending_count++;
+			if (plug) {
+				bio_list_add(&plug->pending, mbio);
+				plug->pending_cnt++;
+			} else {
+				bio_list_add(&conf->pending_bio_list, mbio);
+				conf->pending_count++;
+			}
 			spin_unlock_irqrestore(&conf->device_lock, flags);
-			if (!mddev_check_plugged(mddev))
+			if (!plug)
 				md_wakeup_thread(mddev->thread);
 		}
 	}
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/2] md: delete dead code
From: Shaohua Li @ 2017-02-23 20:34 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb
In-Reply-To: <bf67a7498d5807e3913762f63402f367a6de401e.1487881989.git.shli@fb.com>

Nobody is using mddev_check_plugged(), so delete the dead code

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 8 --------
 drivers/md/md.h | 6 ------
 2 files changed, 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index bef1863..b97f133 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -439,14 +439,6 @@ void md_flush_request(struct mddev *mddev, struct bio *bio)
 }
 EXPORT_SYMBOL(md_flush_request);
 
-void md_unplug(struct blk_plug_cb *cb, bool from_schedule)
-{
-	struct mddev *mddev = cb->data;
-	md_wakeup_thread(mddev->thread);
-	kfree(cb);
-}
-EXPORT_SYMBOL(md_unplug);
-
 static inline struct mddev *mddev_get(struct mddev *mddev)
 {
 	atomic_inc(&mddev->active);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 44fe227..8c4f462 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -674,16 +674,10 @@ extern void md_rdev_clear(struct md_rdev *rdev);
 extern void mddev_suspend(struct mddev *mddev);
 extern void mddev_resume(struct mddev *mddev);
 
-extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule);
 extern void md_reload_sb(struct mddev *mddev, int raid_disk);
 extern void md_update_sb(struct mddev *mddev, int force);
 extern void md_kick_rdev_from_array(struct md_rdev * rdev);
 struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
-static inline int mddev_check_plugged(struct mddev *mddev)
-{
-	return !!blk_check_plugged(md_unplug, mddev,
-				   sizeof(struct blk_plug_cb));
-}
 
 static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
 {
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: NeilBrown @ 2017-02-23 23:14 UTC (permalink / raw)
  To: Coly Li, Shaohua Li
  Cc: linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <488f88e0-8111-a320-3abb-c6e6611a957e@suse.de>

[-- Attachment #1: Type: text/plain, Size: 378 bytes --]

On Thu, Feb 23 2017, Coly Li wrote:

>
> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
> bucket size to 8MB, running for 10 hours, there is no deadlock observed,

Try setting BARRIER_BUCKETS_NR to '1' and BARRIER_UNIT_SECTOR_BITS to 3
and make sure the write requests are larger than 1 page (and have resync
happen at the same time as writes).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] Monitor: triggers core dump when stat2devnm return NULL
From: zhilong @ 2017-02-24  3:05 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <1487050315-27940-1-git-send-email-zlliu@suse.com>

Hi, Jes;

I just wanted to ping and see if this could get a review:


On 02/14/2017 01:31 PM, Zhilong Liu wrote:
>      ensure that the devnm should be a block device when uses
>      --wait parameter, such as the 'f' and 'd' type file would
>      be triggered core dumped.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>
> diff --git a/Monitor.c b/Monitor.c
> index 802a9d9..1900db3 100644
> --- a/Monitor.c
> +++ b/Monitor.c
> @@ -1002,6 +1002,10 @@ int Wait(char *dev)
>   			strerror(errno));
>   		return 2;
>   	}
> +	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
> +		pr_err("%s is not a block device.\n", dev);
> +		return 2;
> +	}
>   	strcpy(devnm, stat2devnm(&stb));
>   
>   	while(1) {
> diff --git a/lib.c b/lib.c
> index b640634..7116298 100644
> --- a/lib.c
> +++ b/lib.c
> @@ -89,9 +89,6 @@ char *devid2kname(int devid)
>   
>   char *stat2kname(struct stat *st)
>   {
> -	if ((S_IFMT & st->st_mode) != S_IFBLK)
> -		return NULL;
> -
>   	return devid2kname(st->st_rdev);
>   }
>   


^ permalink raw reply

* Re: [PATCH] Monitor: triggers core dump when stat2devnm return NULL
From: zhilong @ 2017-02-24  3:06 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <1487050315-27940-1-git-send-email-zlliu@suse.com>

Hi, Jes;
   I just wanted to ping and see if this could get a review:


On 02/14/2017 01:31 PM, Zhilong Liu wrote:
>      ensure that the devnm should be a block device when uses
>      --wait parameter, such as the 'f' and 'd' type file would
>      be triggered core dumped.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>
> diff --git a/Monitor.c b/Monitor.c
> index 802a9d9..1900db3 100644
> --- a/Monitor.c
> +++ b/Monitor.c
> @@ -1002,6 +1002,10 @@ int Wait(char *dev)
>   			strerror(errno));
>   		return 2;
>   	}
> +	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
> +		pr_err("%s is not a block device.\n", dev);
> +		return 2;
> +	}
>   	strcpy(devnm, stat2devnm(&stb));
>   
>   	while(1) {
> diff --git a/lib.c b/lib.c
> index b640634..7116298 100644
> --- a/lib.c
> +++ b/lib.c
> @@ -89,9 +89,6 @@ char *devid2kname(int devid)
>   
>   char *stat2kname(struct stat *st)
>   {
> -	if ((S_IFMT & st->st_mode) != S_IFBLK)
> -		return NULL;
> -
>   	return devid2kname(st->st_rdev);
>   }
>   


^ permalink raw reply

* Re: [PATCH] Monitor: dev should be a block file in waitclean
From: zhilong @ 2017-02-24  3:07 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <1487058625-28852-1-git-send-email-zlliu@suse.com>

Hi, Jes;
   I just wanted to ping and see if this could get a review:


On 02/14/2017 03:50 PM, Zhilong Liu wrote:
>      mdadm --wait-clean /dev/mdX, the dev should
>      be a block file, otherwise fd2devnm returns
>      NULL and triggers core dumped.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>
>
> diff --git a/Monitor.c b/Monitor.c
> index 802a9d9..d16ac49 100644
> --- a/Monitor.c
> +++ b/Monitor.c
> @@ -1060,7 +1060,17 @@ int WaitClean(char *dev, int sock, int verbose)
>   	struct mdinfo *mdi;
>   	int rv = 1;
>   	char devnm[32];
> +	struct stat stb;
>   
> +	if (stat(dev, &stb) != 0) {
> +		pr_err("Cannot find %s: %s\n", dev,
> +			strerror(errno));
> +		return 2;
> +	}
> +	if ((S_IFMT & stb.st_mode) != S_IFBLK) {
> +		pr_err("%s is not a block device.\n", dev);
> +		return 2;
> +	}
>   	fd = open(dev, O_RDONLY);
>   	if (fd < 0) {
>   		if (verbose)


^ permalink raw reply

* [PATCH 00/14] the latest changes for md-cluster
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Hi,

This patchset is based on md/for-next branch, which includes below parts:

1. some code improvements (0001-0004).
2. changes for use sync way to handle METADATA_UPDATED msg (0005-0009).
3. add resize support for md-cluster (0010-0012 and 0014), and also
   make some related changes inside md (0013).

Thanks,
Guoqing   

^ permalink raw reply

* [PATCH 01/14] md-cluster: remove unnecessary header files
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang
In-Reply-To: <1487906124-20107-1-git-send-email-gqjiang@suse.com>

md-cluster.h is already included in md.h, so remove
the redundant one and we don't want to cross include
header file too.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 1 -
 drivers/md/md-cluster.h | 2 --
 2 files changed, 3 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 2b13117fb918..03c38ed222ce 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -16,7 +16,6 @@
 #include <linux/raid/md_p.h>
 #include "md.h"
 #include "bitmap.h"
-#include "md-cluster.h"
 
 #define LVB_SIZE	64
 #define NEW_DEV_TIMEOUT 5000
diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h
index e765499ba591..8f26a5e80810 100644
--- a/drivers/md/md-cluster.h
+++ b/drivers/md/md-cluster.h
@@ -3,8 +3,6 @@
 #ifndef _MD_CLUSTER_H
 #define _MD_CLUSTER_H
 
-#include "md.h"
-
 struct mddev;
 struct md_rdev;
 
-- 
2.6.2


^ permalink raw reply related

* [PATCH 02/14] md-cluster: free md_cluster_info if node leave cluster
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang
In-Reply-To: <1487906124-20107-1-git-send-email-gqjiang@suse.com>

To avoid memory leak, we need to free the cinfo which
is allocated when node join cluster.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 03c38ed222ce..84c001063983 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -973,6 +973,7 @@ static int leave(struct mddev *mddev)
 	lockres_free(cinfo->bitmap_lockres);
 	unlock_all_bitmaps(mddev);
 	dlm_release_lockspace(cinfo->lockspace, 2);
+	kfree(cinfo);
 	return 0;
 }
 
-- 
2.6.2


^ permalink raw reply related

* [PATCH 03/14] md-cluster: remove useless memset from gather_all_resync_info
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang
In-Reply-To: <1487906124-20107-1-git-send-email-gqjiang@suse.com>

This memset is not needed.  The lvb is already zeroed because
it was recently allocated by lockres_init, which uses kzalloc(),
and read_resync_info() doesn't need it to be zero anyway.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 84c001063983..620828e56dc8 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -776,7 +776,6 @@ static int gather_all_resync_info(struct mddev *mddev, int total_slots)
 		bm_lockres->flags |= DLM_LKF_NOQUEUE;
 		ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
 		if (ret == -EAGAIN) {
-			memset(bm_lockres->lksb.sb_lvbptr, '\0', LVB_SIZE);
 			s = read_resync_info(mddev, bm_lockres);
 			if (s) {
 				pr_info("%s:%d Resync[%llu..%llu] in progress on %d\n",
-- 
2.6.2


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox