Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios
From: Shaohua Li @ 2017-02-16 17:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <148721994120.7521.3518782809103694227.stgit@noble>

On Thu, Feb 16, 2017 at 03:39:01PM +1100, Neil Brown wrote:
> We use md_write_start() to increase the count of pending writes, and
> md_write_end() to decrement the count.  We currently count bios
> submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
> has been attached to.
> 
> So now, raid5_make_request() calls md_write_start() and then
> md_write_end() to keep the count elevated during the setup of the
> request.
> 
> add_stripe_bio() calls md_write_start() for each stripe_head, and the
> completion routines always call md_write_end(), instead of only
> calling it when raid5_dec_bi_active_stripes() returns 0.
> make_discard_request also calls md_write_start/end().
> 
> The parallel between md_write_{start,end} and use of bi_phys_segments
> can be seen in that:
>  Whenever we set bi_phys_segments to 1, we now call md_write_start.
>  Whenever we increment it on non-read requests with
>    raid5_inc_bi_active_stripes(), we now call md_write_start().
>  Whenever we decrement bi_phys_segments on non-read requsts with
>     raid5_dec_bi_active_stripes(), we now call md_write_end().
> 
> This reduces our dependence on keeping a per-bio count of active
> stripes in bi_phys_segments.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/raid5-cache.c |    2 +-
>  drivers/md/raid5.c       |   27 +++++++++++++--------------
>  2 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 302dea3296ba..4b211f914d17 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -265,8 +265,8 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
>  	while (wbi && wbi->bi_iter.bi_sector <
>  	       dev->sector + STRIPE_SECTORS) {
>  		wbi2 = r5_next_bio(wbi, dev->sector);
> +		md_write_end(conf->mddev);
>  		if (!raid5_dec_bi_active_stripes(wbi)) {
> -			md_write_end(conf->mddev);
>  			bio_list_add(return_bi, wbi);
>  		}
>  		wbi = wbi2;
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3c7e106c12a2..760b726943c9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3075,6 +3075,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>  		bi->bi_next = *bip;
>  	*bip = bi;
>  	raid5_inc_bi_active_stripes(bi);
> +	md_write_start(conf->mddev, bi);
>  
>  	if (forwrite) {
>  		/* check if page is covered */
> @@ -3198,10 +3199,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>  			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
>  
>  			bi->bi_error = -EIO;
> -			if (!raid5_dec_bi_active_stripes(bi)) {
> -				md_write_end(conf->mddev);
> +			md_write_end(conf->mddev);
> +			if (!raid5_dec_bi_active_stripes(bi))
>  				bio_list_add(return_bi, bi);
> -			}
>  			bi = nextbi;
>  		}
>  		if (bitmap_end)
> @@ -3222,10 +3222,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>  			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
>  
>  			bi->bi_error = -EIO;
> -			if (!raid5_dec_bi_active_stripes(bi)) {
> -				md_write_end(conf->mddev);
> +			md_write_end(conf->mddev);
> +			if (!raid5_dec_bi_active_stripes(bi))
>  				bio_list_add(return_bi, bi);
> -			}
>  			bi = bi2;
>  		}
>  
> @@ -3582,10 +3581,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>  				while (wbi && wbi->bi_iter.bi_sector <
>  					dev->sector + STRIPE_SECTORS) {
>  					wbi2 = r5_next_bio(wbi, dev->sector);
> -					if (!raid5_dec_bi_active_stripes(wbi)) {
> -						md_write_end(conf->mddev);
> +					md_write_end(conf->mddev);
> +					if (!raid5_dec_bi_active_stripes(wbi))
>  						bio_list_add(return_bi, wbi);
> -					}
>  					wbi = wbi2;
>  				}
>  				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
> @@ -5268,6 +5266,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>  
>  	bi->bi_next = NULL;
>  	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
> +	md_write_start(mddev, bi);

md_write_start calls wait_event, so it can't be called with lock hold. Maybe
add a special __md_write_start which only increases the count, we already wait
in make_request.

>  	stripe_sectors = conf->chunk_sectors *
>  		(conf->raid_disks - conf->max_degraded);
> @@ -5314,6 +5313,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>  			sh->dev[d].towrite = bi;
>  			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
>  			raid5_inc_bi_active_stripes(bi);
> +			md_write_start(mddev, bi);

Ditto.

Thanks,
Shaohua

^ permalink raw reply

* Re: [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated.
From: Shaohua Li @ 2017-02-16 17:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <148721994135.7521.12977218167692514945.stgit@noble>

On Thu, Feb 16, 2017 at 03:39:01PM +1100, Neil Brown wrote:
> If a device fails during a write, we must ensure the failure is
> recorded in the metadata before the completion of the write is
> acknowleged.
> 
> Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before
> write request returns.")  added code for this, but it was
> unnecessarily complicated.  We already had similar functionality for
> handling updates to the bad-block-list, thanks to Commit de393cdea66c
> ("md: make it easier to wait for bad blocks to be acknowledged.")
> 
> So revert most of the former commit, and instead avoid collecting
> completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
> the metadata and retry the stripe_head.
> 
> We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
> asynchronously.  After analyse_stripe(), we have collected stable data
> about the state of devices, which will be used to make decisions.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/raid5.c |   27 ++++-----------------------
>  drivers/md/raid5.h |    3 ---
>  2 files changed, 4 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 760b726943c9..154593e0afbe 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4492,7 +4492,8 @@ static void handle_stripe(struct stripe_head *sh)
>  	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
>  		goto finish;
>  
> -	if (s.handle_bad_blocks) {
> +	if (s.handle_bad_blocks ||
> +	    test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		goto finish;
>  	}

This part is fragile. We don't delay it and post it to handle list again. So
the raid5d/worker could run this stripe infinitely.

Thanks,
Shaohua

^ permalink raw reply

* Re: [md PATCH 14/14] MD: use per-cpu counter for writes_pending
From: Shaohua Li @ 2017-02-16 20:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <148721994322.7521.3431105903303667812.stgit@noble>

On Thu, Feb 16, 2017 at 03:39:03PM +1100, Neil Brown wrote:
> The 'writes_pending' counter is used to determine when the
> array is stable so that it can be marked in the superblock
> as "Clean".  Consequently it needs to be updated frequently
> but only checked for zero occasionally.  Recent changes to
> raid5 cause the count to be updated even more often - once
> per 4K rather than once per bio.  This provided
> justification for making the updates more efficient.
> 
> So we replace the atomic counter with a per-cpu array of
> 'long' counters. Incrementing and decrementing is normally
> much cheaper, testing for zero is more expensive.
> 
> To meaningfully be able to test for zero we need to be able
> to block further updates.  This is done by forcing the
> "increment" step to take a spinlock in the rare case that
> another thread is checking if the count is zero.  This is
> done using a new field: "checkers".  "checkers" is the
> number of threads that are currently checking whether the
> count is zero.  It is usually 0, occasionally 1, and it is
> not impossible that it could be higher, though this would be
> rare.
> 
> If, within an rcu_read_locked section, checkers is seen to
> be zero, then the local-cpu counter can be incremented
> freely.  If checkers is not zero, mddev->lock must be taken
> before the increment is allowed.  A decrement is always
> allowed.
> 
> To test for zero, a thread must increment "checkers", call
> synchronize_rcu(), then take mddev->lock.  Once this is done
> no new increments can happen.  A thread may choose to
> perform a quick test-for-zero by summing all the counters
> without holding a lock.  If this is non-zero, the the total
> count is non-zero, or was non-zero very recently, so it is
> safe to assume that it isn't zero.  If the quick check does
> report a zero sum, then it is worth performing the locking
> protocol.
> 
> When the counter is decremented, it is no longer possible to
> immediately test if the result is zero
> (atmic_dec_and_test()).  We don't even really want to
> perform the "quick" tests as that sums over all cpus and is
> work that will most often bring no benefit.
> 
> In the "safemode==2" case, when we want to mark the array as
> "clean" immediately when there are no writes, we perform the
> quick test anyway, and possibly wake the md thread to do the
> full test.  "safemode==2" is only used during shutdown so
> the cost is not problematic.
> 
> When safemode!=2 we always set the timer, rather than only
> when the counter reaches zero.
> 
> If mod_timer() is called to set the timeout to the value it
> already has, mod_timer() has low overhead with no atomic
> operations.  So at worst it will have a noticeable cost once
> per jiffie.  To further reduce the otherhead, we round the
> requests delay to a multiple of ->safemode_delay.  This
> might increase the delay until the timer fires a little, but
> will reduce the overhead of calling mod_timer()
> significantly.  If lots of requests are completing, the
> timer will be updated every 200 milliseconds (by default)
> and never fire.  When it does eventually fire, it will
> schedule the md thread to perform the full test for
> writes_pending==0, and this is quite likely to find '0'.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

This sounds like a good place to use percpu-refcount. In set_in_sync, we switch
it to atomic, read it, then switch it back to percpu.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 00/17] md: cleanup on direct access to bvec table
From: Shaohua Li @ 2017-02-16 22:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-raid, linux-block,
	Christoph Hellwig, NeilBrown
In-Reply-To: <1487245547-24384-1-git-send-email-tom.leiming@gmail.com>

On Thu, Feb 16, 2017 at 07:45:30PM +0800, Ming Lei wrote:
> In MD's resync I/O path, there are lots of direct access to bio's
> bvec table. This patchset kills most of them, and the conversion
> is quite straightforward.

I don't like this approach. The MD uses a hacky way to manage pages allocated,
this is the root of the problem. The patches add another hack way to do the
management. I'd like to see explict management of the pages, for example, add
data structure in r1bio to manage the pages, then we can use existing API for
all the stuffes we need.

Thanks,
Shaohua

^ permalink raw reply

* Re: RAID10 and 'writemostly' support
From: Anthony Youngman @ 2017-02-17  1:24 UTC (permalink / raw)
  To: Reindl Harald, linux-raid
In-Reply-To: <07e8d394-2f95-caae-d971-a5fef009fefe@thelounge.net>

On 16/02/17 14:08, Reindl Harald wrote:
> Hi
>
> i am new and was redirected to this list from the bugtracker
> please have a look at https://bugzilla.kernel.org/show_bug.cgi?id=194551
>
> currently "writemostly" seems to be only supported on "real" RAID1 while
> i was in hope that by the conecpt of RAID10 having more or less
> RAID0+RAID1 it would also work on RAID10 (and on the virtual machine wre
> i tested mdadm with the flag before buying the disks it did not complain)

Be careful. Don't confuse Raid10 with Raid1+0. They are NOT the same 
thing (on linux at least), although they are very similar.

Cheers,
Wol

^ permalink raw reply

* Re: [PATCH 00/17] md: cleanup on direct access to bvec table
From: Ming Lei @ 2017-02-17  1:25 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, Linux Kernel Mailing List,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, linux-block,
	Christoph Hellwig, NeilBrown
In-Reply-To: <20170216221642.sobqndndd7fbjoo7@kernel.org>

Hi Shaohua,

On Fri, Feb 17, 2017 at 6:16 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Feb 16, 2017 at 07:45:30PM +0800, Ming Lei wrote:
>> In MD's resync I/O path, there are lots of direct access to bio's
>> bvec table. This patchset kills most of them, and the conversion
>> is quite straightforward.
>
> I don't like this approach. The MD uses a hacky way to manage pages allocated,
> this is the root of the problem. The patches add another hack way to do the

Yes, I agree, and bio_iov_iter_get_pages() uses this kind of hacky way too
actually.

> management. I'd like to see explict management of the pages, for example, add
> data structure in r1bio to manage the pages, then we can use existing API for
> all the stuffes we need.

Yeah, that is definitely clean, but we have to pay the following cost:

- allocate at least N * (128 + 4) bytes per each r1_bio/r10_bio
- N is pool_info.raid_disks for raid1, and conf->copies for raid10

If we are happy to introduce the cost, I can take this way in V1.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [md PATCH 01/14] md/raid5: use md_write_start to count stripes, not bios
From: NeilBrown @ 2017-02-17  2:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <20170216172926.lojx2a7ceqifrklq@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5258 bytes --]

On Thu, Feb 16 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 03:39:01PM +1100, Neil Brown wrote:
>> We use md_write_start() to increase the count of pending writes, and
>> md_write_end() to decrement the count.  We currently count bios
>> submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
>> has been attached to.
>> 
>> So now, raid5_make_request() calls md_write_start() and then
>> md_write_end() to keep the count elevated during the setup of the
>> request.
>> 
>> add_stripe_bio() calls md_write_start() for each stripe_head, and the
>> completion routines always call md_write_end(), instead of only
>> calling it when raid5_dec_bi_active_stripes() returns 0.
>> make_discard_request also calls md_write_start/end().
>> 
>> The parallel between md_write_{start,end} and use of bi_phys_segments
>> can be seen in that:
>>  Whenever we set bi_phys_segments to 1, we now call md_write_start.
>>  Whenever we increment it on non-read requests with
>>    raid5_inc_bi_active_stripes(), we now call md_write_start().
>>  Whenever we decrement bi_phys_segments on non-read requsts with
>>     raid5_dec_bi_active_stripes(), we now call md_write_end().
>> 
>> This reduces our dependence on keeping a per-bio count of active
>> stripes in bi_phys_segments.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/raid5-cache.c |    2 +-
>>  drivers/md/raid5.c       |   27 +++++++++++++--------------
>>  2 files changed, 14 insertions(+), 15 deletions(-)
>> 
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index 302dea3296ba..4b211f914d17 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -265,8 +265,8 @@ r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
>>  	while (wbi && wbi->bi_iter.bi_sector <
>>  	       dev->sector + STRIPE_SECTORS) {
>>  		wbi2 = r5_next_bio(wbi, dev->sector);
>> +		md_write_end(conf->mddev);
>>  		if (!raid5_dec_bi_active_stripes(wbi)) {
>> -			md_write_end(conf->mddev);
>>  			bio_list_add(return_bi, wbi);
>>  		}
>>  		wbi = wbi2;
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 3c7e106c12a2..760b726943c9 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -3075,6 +3075,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>>  		bi->bi_next = *bip;
>>  	*bip = bi;
>>  	raid5_inc_bi_active_stripes(bi);
>> +	md_write_start(conf->mddev, bi);
>>  
>>  	if (forwrite) {
>>  		/* check if page is covered */
>> @@ -3198,10 +3199,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>>  			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
>>  
>>  			bi->bi_error = -EIO;
>> -			if (!raid5_dec_bi_active_stripes(bi)) {
>> -				md_write_end(conf->mddev);
>> +			md_write_end(conf->mddev);
>> +			if (!raid5_dec_bi_active_stripes(bi))
>>  				bio_list_add(return_bi, bi);
>> -			}
>>  			bi = nextbi;
>>  		}
>>  		if (bitmap_end)
>> @@ -3222,10 +3222,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
>>  			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
>>  
>>  			bi->bi_error = -EIO;
>> -			if (!raid5_dec_bi_active_stripes(bi)) {
>> -				md_write_end(conf->mddev);
>> +			md_write_end(conf->mddev);
>> +			if (!raid5_dec_bi_active_stripes(bi))
>>  				bio_list_add(return_bi, bi);
>> -			}
>>  			bi = bi2;
>>  		}
>>  
>> @@ -3582,10 +3581,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
>>  				while (wbi && wbi->bi_iter.bi_sector <
>>  					dev->sector + STRIPE_SECTORS) {
>>  					wbi2 = r5_next_bio(wbi, dev->sector);
>> -					if (!raid5_dec_bi_active_stripes(wbi)) {
>> -						md_write_end(conf->mddev);
>> +					md_write_end(conf->mddev);
>> +					if (!raid5_dec_bi_active_stripes(wbi))
>>  						bio_list_add(return_bi, wbi);
>> -					}
>>  					wbi = wbi2;
>>  				}
>>  				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
>> @@ -5268,6 +5266,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>>  
>>  	bi->bi_next = NULL;
>>  	bi->bi_phys_segments = 1; /* over-loaded to count active stripes */
>> +	md_write_start(mddev, bi);
>
> md_write_start calls wait_event, so it can't be called with lock hold. Maybe
> add a special __md_write_start which only increases the count, we already wait
> in make_request.

I cannot see what lock is held here, but I can see a that a lock is held
below while md_write_start() is called.

An "md_write_incr()" or similar would certainly make sense.  I'll add
that in.

Thanks,
NeilBrown


>
>>  	stripe_sectors = conf->chunk_sectors *
>>  		(conf->raid_disks - conf->max_degraded);
>> @@ -5314,6 +5313,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>>  			sh->dev[d].towrite = bi;
>>  			set_bit(R5_OVERWRITE, &sh->dev[d].flags);
>>  			raid5_inc_bi_active_stripes(bi);
>> +			md_write_start(mddev, bi);
>
> Ditto.
>
> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [md PATCH 02/14] md/raid5: simplfy delaying of writes while metadata is updated.
From: NeilBrown @ 2017-02-17  2:10 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <20170216173757.chtxbvpbzxcqjk62@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2134 bytes --]

On Thu, Feb 16 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 03:39:01PM +1100, Neil Brown wrote:
>> If a device fails during a write, we must ensure the failure is
>> recorded in the metadata before the completion of the write is
>> acknowleged.
>> 
>> Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before
>> write request returns.")  added code for this, but it was
>> unnecessarily complicated.  We already had similar functionality for
>> handling updates to the bad-block-list, thanks to Commit de393cdea66c
>> ("md: make it easier to wait for bad blocks to be acknowledged.")
>> 
>> So revert most of the former commit, and instead avoid collecting
>> completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
>> the metadata and retry the stripe_head.
>> 
>> We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
>> asynchronously.  After analyse_stripe(), we have collected stable data
>> about the state of devices, which will be used to make decisions.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/raid5.c |   27 ++++-----------------------
>>  drivers/md/raid5.h |    3 ---
>>  2 files changed, 4 insertions(+), 26 deletions(-)
>> 
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 760b726943c9..154593e0afbe 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -4492,7 +4492,8 @@ static void handle_stripe(struct stripe_head *sh)
>>  	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
>>  		goto finish;
>>  
>> -	if (s.handle_bad_blocks) {
>> +	if (s.handle_bad_blocks ||
>> +	    test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>>  		set_bit(STRIPE_HANDLE, &sh->state);
>>  		goto finish;
>>  	}
>
> This part is fragile. We don't delay it and post it to handle list again. So
> the raid5d/worker could run this stripe infinitely.
>

Ah yes - I forget about the worker threads.

I think we need to get raid5_do_work() to block while MD_CHANGE_PENDING
is set.
I'll look over the interactions there and see if that really make sense.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [md PATCH 09/14] md/raid10: stop using bi_phys_segments
From: NeilBrown @ 2017-02-17  2:15 UTC (permalink / raw)
  To: Jack Wang; +Cc: Shaohua Li, linux-raid, Christoph Hellwig
In-Reply-To: <CA+res+SSS4c1nXo8AE14ku2q_3Bb+w-+XuZOK14SHLACWkHzRw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3404 bytes --]

On Thu, Feb 16 2017, Jack Wang wrote:

> 2017-02-16 5:39 GMT+01:00 NeilBrown <neilb@suse.com>:
>> raid10 currently repurposes bi_phys_segments on each
>> incoming bio to count how many r10bio was used to encode the
>> request.
>>
>> We need to know when the number of attached r10bio reaches
>> zero to:
>> 1/ call bio_endio() when all IO on the bio is finished
>> 2/ decrement ->nr_pending so that resync IO can proceed.
>>
>> Now that the bio has its own __bi_remaining counter, that
>> can be used instead. We can call bio_inc_remaining to
>> increment the counter and call bio_endio() every time an
>> r10bio completes, rather than only when bi_phys_segments
>> reaches zero.
>>
>> This addresses point 1, but not point 2.  bio_endio()
>> doesn't (and cannot) report when the last r10bio has
>> finished, so a different approach is needed.
>>
>> So: instead of counting bios in ->nr_pending, count r10bios.
>> i.e. every time we attach a bio, increment nr_pending.
>> Every time an r10bio completes, decrement nr_pending.
>>
>> Normally we only increment nr_pending after first checking
>> that ->barrier is zero, or some other non-trivial tests and
>> possible waiting.  When attaching multiple r10bios to a bio,
>> we only need the tests and the waiting once.  After the
>> first increment, subsequent increments can happen
>> unconditionally as they are really all part of the one
>> request.
>>
>> So introduce inc_pending() which can be used when we know
>> that nr_pending is already elevated.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/raid10.c |   76 +++++++++++++++++----------------------------------
>>  1 file changed, 25 insertions(+), 51 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 9258cbe233bb..6b4d8643c574 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -301,27 +301,18 @@ static void reschedule_retry(struct r10bio *r10_bio)
>>  static void raid_end_bio_io(struct r10bio *r10_bio)
>>  {
>>         struct bio *bio = r10_bio->master_bio;
>> -       int done;
>>         struct r10conf *conf = r10_bio->mddev->private;
>>
>> -       if (bio->bi_phys_segments) {
>> -               unsigned long flags;
>> -               spin_lock_irqsave(&conf->device_lock, flags);
>> -               bio->bi_phys_segments--;
>> -               done = (bio->bi_phys_segments == 0);
>> -               spin_unlock_irqrestore(&conf->device_lock, flags);
>> -       } else
>> -               done = 1;
>>         if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>>                 bio->bi_error = -EIO;
>> -       if (done) {
>> -               bio_endio(bio);
>> -               /*
>> -                * Wake up any possible resync thread that waits for the device
>> -                * to go idle.
>> -                */
>> -               allow_barrier(conf);
>> -       }
>> +
>> +       /*
>> +        * Wake up any possible resync thread that waits for the device
>> +        * to go idle.
>> +        */
>> +       allow_barrier(conf);
>> +       bio_endio(bio);
>> +
>
> Hi Neil,
>
> Why do you switch the order of above 2 lines, is there a reason
> behind, I notice in raid1 you kept the order?

I cannot think why I would have done that.  It doesn't matter what
order they are in, but it does make sense to leave the order unchanged
unless there is a good reason.  I'll put it back.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [md PATCH 14/14] MD: use per-cpu counter for writes_pending
From: NeilBrown @ 2017-02-17  2:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <20170216201211.3q7rvvp2p3xxz6ew@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 4034 bytes --]

On Thu, Feb 16 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 03:39:03PM +1100, Neil Brown wrote:
>> The 'writes_pending' counter is used to determine when the
>> array is stable so that it can be marked in the superblock
>> as "Clean".  Consequently it needs to be updated frequently
>> but only checked for zero occasionally.  Recent changes to
>> raid5 cause the count to be updated even more often - once
>> per 4K rather than once per bio.  This provided
>> justification for making the updates more efficient.
>> 
>> So we replace the atomic counter with a per-cpu array of
>> 'long' counters. Incrementing and decrementing is normally
>> much cheaper, testing for zero is more expensive.
>> 
>> To meaningfully be able to test for zero we need to be able
>> to block further updates.  This is done by forcing the
>> "increment" step to take a spinlock in the rare case that
>> another thread is checking if the count is zero.  This is
>> done using a new field: "checkers".  "checkers" is the
>> number of threads that are currently checking whether the
>> count is zero.  It is usually 0, occasionally 1, and it is
>> not impossible that it could be higher, though this would be
>> rare.
>> 
>> If, within an rcu_read_locked section, checkers is seen to
>> be zero, then the local-cpu counter can be incremented
>> freely.  If checkers is not zero, mddev->lock must be taken
>> before the increment is allowed.  A decrement is always
>> allowed.
>> 
>> To test for zero, a thread must increment "checkers", call
>> synchronize_rcu(), then take mddev->lock.  Once this is done
>> no new increments can happen.  A thread may choose to
>> perform a quick test-for-zero by summing all the counters
>> without holding a lock.  If this is non-zero, the the total
>> count is non-zero, or was non-zero very recently, so it is
>> safe to assume that it isn't zero.  If the quick check does
>> report a zero sum, then it is worth performing the locking
>> protocol.
>> 
>> When the counter is decremented, it is no longer possible to
>> immediately test if the result is zero
>> (atmic_dec_and_test()).  We don't even really want to
>> perform the "quick" tests as that sums over all cpus and is
>> work that will most often bring no benefit.
>> 
>> In the "safemode==2" case, when we want to mark the array as
>> "clean" immediately when there are no writes, we perform the
>> quick test anyway, and possibly wake the md thread to do the
>> full test.  "safemode==2" is only used during shutdown so
>> the cost is not problematic.
>> 
>> When safemode!=2 we always set the timer, rather than only
>> when the counter reaches zero.
>> 
>> If mod_timer() is called to set the timeout to the value it
>> already has, mod_timer() has low overhead with no atomic
>> operations.  So at worst it will have a noticeable cost once
>> per jiffie.  To further reduce the otherhead, we round the
>> requests delay to a multiple of ->safemode_delay.  This
>> might increase the delay until the timer fires a little, but
>> will reduce the overhead of calling mod_timer()
>> significantly.  If lots of requests are completing, the
>> timer will be updated every 200 milliseconds (by default)
>> and never fire.  When it does eventually fire, it will
>> schedule the md thread to perform the full test for
>> writes_pending==0, and this is quite likely to find '0'.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> This sounds like a good place to use percpu-refcount. In set_in_sync, we switch
> it to atomic, read it, then switch it back to percpu.

I did have a quick look at percpu-refcount and it didn't seem suitable.
There is no interface to set the counter to atomic and wait for that to
complete.
There is also no interface to perform a lockless sum - only expected to
report '0' if the count has been zero for a little while.

I might be able to make do without those, or possibly enhance
percpu-refcount.

I'll have a look - thanks.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 00/17] md: cleanup on direct access to bvec table
From: Shaohua Li @ 2017-02-17  4:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List,
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT, linux-block,
	Christoph Hellwig, NeilBrown
In-Reply-To: <CACVXFVPRuikOn=zJ5fboGMyQs3rMxaUK4f-Lk0s-+HpPy+ov8w@mail.gmail.com>

On Fri, Feb 17, 2017 at 09:25:27AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Fri, Feb 17, 2017 at 6:16 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Thu, Feb 16, 2017 at 07:45:30PM +0800, Ming Lei wrote:
> >> In MD's resync I/O path, there are lots of direct access to bio's
> >> bvec table. This patchset kills most of them, and the conversion
> >> is quite straightforward.
> >
> > I don't like this approach. The MD uses a hacky way to manage pages allocated,
> > this is the root of the problem. The patches add another hack way to do the
> 
> Yes, I agree, and bio_iov_iter_get_pages() uses this kind of hacky way too
> actually.
> 
> > management. I'd like to see explict management of the pages, for example, add
> > data structure in r1bio to manage the pages, then we can use existing API for
> > all the stuffes we need.
> 
> Yeah, that is definitely clean, but we have to pay the following cost:
> 
> - allocate at least N * (128 + 4) bytes per each r1_bio/r10_bio
> - N is pool_info.raid_disks for raid1, and conf->copies for raid10
> 
> If we are happy to introduce the cost, I can take this way in V1.

It's not a big deal. The inflight bio shouldn't be big, so the r1_bio count
isn't big. We don't waste much.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-17  6:56 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: Shaohua Li, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <87shnevcpr.fsf@notabene.neil.brown.name>

On 2017/2/16 下午3:04, NeilBrown wrote:
> On Thu, Feb 16 2017, colyli@suse.de wrote:

[snip]
>> - I/O barrier bucket is indexed by bio start sector If multiple
>> I/O requests hit different I/O barrier units, they only need to
>> compete I/O barrier with other I/Os which hit the same I/O
>> barrier bucket index with each other. The index of a barrier
>> bucket which a bio should look for is calculated by
>> sector_to_idx() which is defined in raid1.h as an inline
>> function, static inline int sector_to_idx(sector_t sector) { 
>> return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, 
>> BARRIER_BUCKETS_NR_BITS); } Here sector_nr is the start sector
>> number of a bio.
> 
> "hash_long() is used so that sequential writes in are region of
> the array which is not being resynced will not consistently align
> with the buckets that are being sequentially resynced, as described
> below"

Sorry, this sentence is too long to be understood by me ... Could you
please to use some simple and short sentence ?


[snip]
>> There are two changes should be noticed, - In raid1d(), I change
>> the code to decrease conf->nr_pending[idx] into single loop, it
>> looks like this, spin_lock_irqsave(&conf->device_lock, flags); 
>> conf->nr_queued[idx]--; 
>> spin_unlock_irqrestore(&conf->device_lock, flags); This change
>> generates more spin lock operations, but in next patch of this
>> patch set, it will be replaced by a single line code, 
>> atomic_dec(&conf->nr_queueud[idx]); So we don't need to worry
>> about spin lock cost here. - Mainline raid1 code split original
>> raid1_make_request() into raid1_read_request() and
>> raid1_write_request(). If the original bio goes across an I/O
>> barrier unit size, this bio will be split before calling
>> raid1_read_request() or raid1_write_request(),  this change the
>> code logic more simple and clear. - In this patch wait_barrier()
>> is moved from raid1_make_request() to raid1_write_request(). In
>> raid_read_request(), original wait_barrier() is replaced by
>> raid1_read_request(). The differnece is wait_read_barrier() only
>> waits if array is frozen, using different barrier function in
>> different code path makes the code more clean and easy to read.
> 
> Thank you for putting the effort into writing a comprehensve
> change description.  I really appreciate it.

Neil, thank you so much. I fix all the above typos in next version,
once I understand that long sentence I will include it in the patch
comments too.


>> 
>> @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct
>> mddev *mddev, struct bio *bio,
>> 
>> static void raid1_make_request(struct mddev *mddev, struct bio
>> *bio) { -	struct r1conf *conf = mddev->private; -	struct r1bio
>> *r1_bio; +	void (*make_request_fn)(struct mddev *mddev, struct
>> bio *bio); +	struct bio *split; +	sector_t sectors;
>> 
>> -	/* -	 * make_request() can abort the operation when read-ahead
>> is being -	 * used and no empty request is available. -	 * -	 */ 
>> -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO); +
>> make_request_fn = (bio_data_dir(bio) == READ) ? +
>> raid1_read_request : raid1_write_request;
>> 
>> -	r1_bio->master_bio = bio; -	r1_bio->sectors =
>> bio_sectors(bio); -	r1_bio->state = 0; -	r1_bio->mddev = mddev; -
>> r1_bio->sector = bio->bi_iter.bi_sector; - -	/* -	 * We might
>> need to issue multiple reads to different devices if there -	 *
>> are bad blocks around, so we keep track of the number of reads
>> in -	 * bio->bi_phys_segments.  If this is 0, there is only one
>> r1_bio and -	 * no locking will be needed when requests complete.
>> If it is -	 * non-zero, then it is the number of not-completed
>> requests. -	 */ -	bio->bi_phys_segments = 0; -
>> bio_clear_flag(bio, BIO_SEG_VALID); +	/* if bio exceeds barrier
>> unit boundary, split it */ +	do { +		sectors =
>> align_to_barrier_unit_end( +				bio->bi_iter.bi_sector,
>> bio_sectors(bio)); +		if (sectors < bio_sectors(bio)) { +			split
>> = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); +
>> bio_chain(split, bio); +		} else { +			split = bio; +		}
>> 
>> -	if (bio_data_dir(bio) == READ) -		raid1_read_request(mddev,
>> bio, r1_bio); -	else -		raid1_write_request(mddev, bio, r1_bio); 
>> +		make_request_fn(mddev, split); +	} while (split != bio); }
> 
> I know you are going to change this as Shaohua wantsthe spitting to
> happen in a separate function, which I agree with, but there is 
> something else wrong here. Calling bio_split/bio_chain repeatedly
> in a loop is dangerous. It is OK for simple devices, but when one
> request can wait for another request to the same device it can
> deadlock. This can happen with raid1.  If a resync request calls
> raise_barrier() between one request and the next, then the next has
> to wait for the resync request, which has to wait for the first
> request. As the first request will be stuck in the queue in 
> generic_make_request(), you get a deadlock.

For md raid1, queue in generic_make_request(), can I understand it as
bio_list_on_stack in this function? And queue in underlying device,
can I understand it as the data structures like plug->pending and
conf->pending_bio_list ?

I still don't get the point of deadlock, let me try to explain why I
don't see the possible deadlock. If a bio is split, and the first part
is processed by make_request_fn(), and then a resync comes and it will
raise a barrier, there are 3 possible conditions,
- the resync I/O tries to raise barrier on same bucket of the first
regular bio. Then the resync task has to wait to the first bio drops
its conf->nr_pending[idx]
- the resync I/O hits a barrier bucket not related to the first split
regular I/O or the second split regular I/O, no one will be blocked.
- the resync I/O hits the same barrier bucket which the second split
regular bio will access, then the barrier is raised and second split
bio will be blocked, but the first split regular I/O will continue to
go ahead.

The first and the second split bios won't hit same I/O barrier bucket,
and a single resync bio may only be interfered with one of the split
regular I/O. This is why I don't see how the deadlock comes.


And one more question is, even there is only one I/O barrier bucket, I
don't understand how the first split regular I/O will be stuck in the
queue in generic_make_request(), I assume the queue is
bio_list_on_stack. If the first split bio is stuck in
generic_make_request() I can understand there is a deadlock, but I
don't see why it may be blocked there. Could you please to give me
more hint ?


> It is much safer to:
> 
> if (need to split) { split = bio_split(bio, ...) bio_chain(...) 
> make_request_fn(split); generic_make_request(bio); } else 
> make_request_fn(mddev, bio);
> 
> This way we first process the initial section of the bio (in
> 'split') which will queue some requests to the underlying devices.
> These requests will be queued in generic_make_request. Then we
> queue the remainder of the bio, which will be added to the end of
> the generic_make_request queue. Then we return. 
> generic_make_request() will pop the lower-level device requests off
> the queue and handle them first.  Then it will process the
> remainder of the original bio once the first section has been fully
> processed.

Do you mean before the first split bio being fully processed, the
second split bio should not be sent to the underlying device ? If this
is what I understand, it seems the split bio won't be merged into
large request in underlying device, it might be a tiny performance lost ?

Coly



^ permalink raw reply

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: Coly Li @ 2017-02-17  7:56 UTC (permalink / raw)
  To: NeilBrown, linux-raid
  Cc: Shaohua Li, Hannes Reinecke, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <87r32yvcoz.fsf@notabene.neil.brown.name>

On 2017/2/16 下午3:04, NeilBrown wrote:
> On Thu, Feb 16 2017, colyli@suse.de wrote:
> 
>> @@ -2393,6 +2455,11 @@ static void handle_write_finished(struct
>> r1conf *conf, struct r1bio *r1_bio) idx =
>> sector_to_idx(r1_bio->sector); conf->nr_queued[idx]++; 
>> spin_unlock_irq(&conf->device_lock); +		/* +		 * In case
>> freeze_array() is waiting for condition +		 *
>> get_unqueued_pending() == extra to be true. +		 */ +
>> wake_up(&conf->wait_barrier); 
>> md_wakeup_thread(conf->mddev->thread); } else { if
>> (test_bit(R1BIO_WriteError, &r1_bio->state)) @@ -2529,9 +2596,7
>> @@ static void raid1d(struct md_thread *thread) retry_list); 
>> list_del(&r1_bio->retry_list); idx =
>> sector_to_idx(r1_bio->sector); -
>> spin_lock_irqsave(&conf->device_lock, flags); 
>> conf->nr_queued[idx]--; -
>> spin_unlock_irqrestore(&conf->device_lock, flags);
> 
> Why do you think it is safe to decrement nr_queued without holding
> the lock? Surely this could race with handle_write_finished, and an
> update could be lost.

conf->nr_queued[idx] is an integer and aligned to 4 bytes address, so
conf->nr_queued[idx]++ is same to atomic_inc(&conf->nr_queued[idx]),
it is atomic operation. And there is no ordering requirement, so I
don't need memory barrier here. This is why I remove spin lock, and
change it from atomic_t back to int.

IMHO, the problematic location is not here, but in freeze_array(). Now
the code assume array is froze when "get_unqueued_pending(conf) ==
extra" gets true. I think it is incorrect.

After conf->array_frozen is set to 1, raid1 code may still handle the
on flying requests, so conf->nr_pending[] and conf->nr_queued[] may
both decreasing. There is possibility that get_unqueued_pending()
returns 0 before everything is quiet at a very shot moment. If the
wait_event inside freeze_array() just catches this moment and gets a
true condition, continue to go and back to its caller, there will be
things unexpected happen.

I don't cover this issue in this patch set because I feel this is
another topic. Hmm, maybe I am a little off topic here.

Coly Li

^ permalink raw reply

* Re: [PATCH 12/17] md: raid1: avoid direct access to bvec table in process_checks()
From: kbuild test robot @ 2017-02-17  8:33 UTC (permalink / raw)
  Cc: kbuild-all, Shaohua Li, Jens Axboe, linux-kernel, linux-raid,
	linux-block, Christoph Hellwig, NeilBrown, Ming Lei
In-Reply-To: <1487245547-24384-13-git-send-email-tom.leiming@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6268 bytes --]

Hi Ming,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.10-rc8]
[cannot apply to next-20170216]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ming-Lei/md-cleanup-on-direct-access-to-bvec-table/20170216-210357
config: powerpc-cell_defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=powerpc 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   In file included from arch/powerpc/include/asm/page.h:331:0,
                    from arch/powerpc/include/asm/thread_info.h:34,
                    from include/linux/thread_info.h:25,
                    from include/asm-generic/preempt.h:4,
                    from ./arch/powerpc/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:59,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/md/raid1.c:34:
   drivers/md/raid1.c: In function 'raid1d':
>> include/asm-generic/memory_model.h:54:52: warning: 'sbio_pages$' may be used uninitialized in this function [-Wmaybe-uninitialized]
    #define __page_to_pfn(page) (unsigned long)((page) - vmemmap)
                                                       ^
   drivers/md/raid1.c:2008:42: note: 'sbio_pages$' was declared here
     struct page *pbio_pages[RESYNC_PAGES], *sbio_pages[RESYNC_PAGES];
                                             ^~~~~~~~~~
   drivers/md/raid1.c:2075:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized]
        if (memcmp(page_address(p),
            ^~~~~~~~~~~~~~~~~~~~~~~
            page_address(s),
            ~~~~~~~~~~~~~~~~
            page_len[j]))
            ~~~~~~~~~~~~
   drivers/md/raid1.c:2007:6: note: 'page_len$' was declared here
     int page_len[RESYNC_PAGES];
         ^~~~~~~~
   In file included from arch/powerpc/include/asm/page.h:331:0,
                    from arch/powerpc/include/asm/thread_info.h:34,
                    from include/linux/thread_info.h:25,
                    from include/asm-generic/preempt.h:4,
                    from ./arch/powerpc/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:59,
                    from include/linux/spinlock.h:50,
                    from include/linux/mmzone.h:7,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:14,
                    from drivers/md/raid1.c:34:
>> include/asm-generic/memory_model.h:54:52: warning: 'pbio_pages$' may be used uninitialized in this function [-Wmaybe-uninitialized]
    #define __page_to_pfn(page) (unsigned long)((page) - vmemmap)
                                                       ^
   drivers/md/raid1.c:2008:15: note: 'pbio_pages$' was declared here
     struct page *pbio_pages[RESYNC_PAGES], *sbio_pages[RESYNC_PAGES];
                  ^~~~~~~~~~
   drivers/md/raid1.c:1978:8: warning: 'pages$' may be used uninitialized in this function [-Wmaybe-uninitialized]
       if (r1_sync_page_io(rdev, sect, s,
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             pages[idx],
             ~~~~~~~~~~~
             READ) != 0)
             ~~~~~
   drivers/md/raid1.c:1872:15: note: 'pages$' was declared here
     struct page *pages[RESYNC_PAGES];
                  ^~~~~

vim +54 include/asm-generic/memory_model.h

a117e66e KAMEZAWA Hiroyuki  2006-03-27  38  ({	unsigned long __pfn = (pfn);		\
c5d71243 Rafael J. Wysocki  2008-11-08  39  	unsigned long __nid = arch_pfn_to_nid(__pfn);  \
a117e66e KAMEZAWA Hiroyuki  2006-03-27  40  	NODE_DATA(__nid)->node_mem_map + arch_local_page_offset(__pfn, __nid);\
a117e66e KAMEZAWA Hiroyuki  2006-03-27  41  })
a117e66e KAMEZAWA Hiroyuki  2006-03-27  42  
67de6482 Andy Whitcroft     2006-06-23  43  #define __page_to_pfn(pg)						\
aa462abe Ian Campbell       2011-08-17  44  ({	const struct page *__pg = (pg);					\
a0140c1d KAMEZAWA Hiroyuki  2006-03-27  45  	struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg));	\
a0140c1d KAMEZAWA Hiroyuki  2006-03-27  46  	(unsigned long)(__pg - __pgdat->node_mem_map) +			\
a0140c1d KAMEZAWA Hiroyuki  2006-03-27  47  	 __pgdat->node_start_pfn;					\
a117e66e KAMEZAWA Hiroyuki  2006-03-27  48  })
a117e66e KAMEZAWA Hiroyuki  2006-03-27  49  
8f6aac41 Christoph Lameter  2007-10-16  50  #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
8f6aac41 Christoph Lameter  2007-10-16  51  
af901ca1 André Goddard Rosa 2009-11-14  52  /* memmap is virtually contiguous.  */
8f6aac41 Christoph Lameter  2007-10-16  53  #define __pfn_to_page(pfn)	(vmemmap + (pfn))
32272a26 Martin Schwidefsky 2008-12-25 @54  #define __page_to_pfn(page)	(unsigned long)((page) - vmemmap)
8f6aac41 Christoph Lameter  2007-10-16  55  
a117e66e KAMEZAWA Hiroyuki  2006-03-27  56  #elif defined(CONFIG_SPARSEMEM)
a117e66e KAMEZAWA Hiroyuki  2006-03-27  57  /*
1a49123b Zhang Yanfei       2013-10-03  58   * Note: section's mem_map is encoded to reflect its start_pfn.
a117e66e KAMEZAWA Hiroyuki  2006-03-27  59   * section[i].section_mem_map == mem_map's address - start_pfn;
a117e66e KAMEZAWA Hiroyuki  2006-03-27  60   */
67de6482 Andy Whitcroft     2006-06-23  61  #define __page_to_pfn(pg)					\
aa462abe Ian Campbell       2011-08-17  62  ({	const struct page *__pg = (pg);				\

:::::: The code at line 54 was first introduced by commit
:::::: 32272a26974d2027384fd4010cd1780fca425d94 [S390] __page_to_pfn warnings

:::::: TO: Martin Schwidefsky <schwidefsky@de.ibm.com>
:::::: CC: Martin Schwidefsky <schwidefsky@de.ibm.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 18602 bytes --]

^ permalink raw reply

* Re: RAID10 and 'writemostly' support
From: Reindl Harald @ 2017-02-17 10:03 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <033aa14d-32a0-531d-7d7d-c01a0aeb13eb@youngman.org.uk>



Am 17.02.2017 um 02:24 schrieb Anthony Youngman:
> On 16/02/17 14:08, Reindl Harald wrote:
>> Hi
>>
>> i am new and was redirected to this list from the bugtracker
>> please have a look at https://bugzilla.kernel.org/show_bug.cgi?id=194551
>>
>> currently "writemostly" seems to be only supported on "real" RAID1 while
>> i was in hope that by the conecpt of RAID10 having more or less
>> RAID0+RAID1 it would also work on RAID10 (and on the virtual machine wre
>> i tested mdadm with the flag before buying the disks it did not complain)
>
> Be careful. Don't confuse Raid10 with Raid1+0. They are NOT the same
> thing (on linux at least), although they are very similar

yeah, i realized that but anyways thought the "writemostly" logic is 
there too and maybe the docs not up-to-date

sadly i can't write a patch on my own but only point how useful it would be

let's say you need a fast and really large storage on a mostly-read 
workload - take 10x4 TB disks - RAID5/RAID6 is horrible in case of drive 
error and rebuild, 10 x 4 TB SSD is horrible in case of pricing

5x4 TB SSD = 5 x 1400 = 7000
5x4 TB HDD = 5 x 100 = 500
total price 7500 versus 14000 for flash-only

surely, you can setup 5 RAID1 and on top RAID0 or LVM for such a large 
setup if your start from scratch - on the other hand my current RAID10 
is from 2011 where SSD was not such a topic and since the operating 
system is RAID10 too the inital setup is not that easy and after 
Fedora/RHEL "reworked" anaconda it's more painful to impossible (i even 
had enough of the manual partition tool in a virtual machine installing 
CentOS7 and created the data partitions after the OS install)

^ permalink raw reply

* Re: mdadm: how to move superblock 1.0 on reduced components
From: Wols Lists @ 2017-02-17 10:36 UTC (permalink / raw)
  To: G.raud, linux-raid
In-Reply-To: <20170215234114.GL28234@gmx.com>

On 15/02/17 23:41, G.raud wrote:
> I cannot find information about this topic on the web.

I'll add a little more for you to think about ... :-) Have you looked at
the raid wiki? In particular, have you read this section?

https://raid.wiki.kernel.org/index.php/A_guide_to_mdadm#Array_internals_and_how_it_affects_mdadm
> 
> Reducing the size of the components of an md array keeps the superblock
> 1.0 at the same position (which avoids data corruption and maybe is
> required for md to find it).  However how can one then reduce the size
> of the underlying device without losing the superblock?
> 
> In an array with redundancy re-addind reduced components one after the
> other is a rather safe solution, I suppose, but it is time consuming.
> Would it be possible to update the superblock version by doing that,
> either by mixing different superblock versions by giving --metadata with
> --add or by modifying the partitions defining the re-added components so
> that the new devices start at the right offset for a superblock 1.2 (and
> later using the following solution, re-creating the array with the
> original partitions and a 1.2 superblock at the right offset with
> --data-offset)?

I don't believe you can mix different meta-data versions. Almost
certainly you can't mix 0.9 and 1, and I doubt you can mix the different
v1 versions.
> 
> Re-creating all components after a reduction of the underlying devices
> seems possible by using --assume-clean.  What to look for to avoid data
> loss?
> 
> mdadm already supports enlarging a 1.0 component, which requires moving
> the superblock, so the code to reduce one would probably not require
> much efforts.  Could it be added in the future?

Are you sure it doesn't do it already? You will probably find (I can't
say for sure) that mdadm adjusts the size, calculates the new position,
and moves the superblock without actually being aware whether the space
is getting larger or smaller. I know elsewhere "grow" actually means
"make bigger or smaller" - I would be surprised if that's not true here.
> 
> I would like to attempt to move the superblocks 1.0 manually, but I lack
> some information:
> 
> 1) Should the reserved space before the superblock be moved along?
> 
> 2) Is there metadata inside a superblock 1.0 to update after the move
>    (that another program would not correct automatically)?
> 
> 3) How is the superblock detected?  Is there only one position possible
>    for a given chunk size on a given component size?  Should the
>    remaining (padding) sectors of the underlying device be zeroed?
> 
If you're going to move them manually (whatever that means) why not look
at fixing mdadm to do it for you? I believe all the v1 superblocks are
identical, differing only in their location, so if you look at how mdadm
changes the size of a v1.0 that should tell you most of what you need to
know.

I'd just check whether there is sufficient space at the start to move
the superblocks. If there isn't fail with "no space, use --data-offset
to create space before moving the superblock". If there is the space,
just do it.

If you're mad enough to use a hex editor, or dd or whatever to move
stuff around, please write it up :-) I can then add it to the wiki as a
pointer for anyone who wants to modify mdadm.

And while it may be a little extra work, if you're fixing mdadm to move
superblocks around, please make it able to convert any v1 to any other
v1, don't do just your specific requirement. The extra work is hopefully
minimal and would be much appreciated. I personally hate it when I come
across stuff that "sometimes works, sometimes doesn't" for no obvious
reason other than the author fixed his specific problem rather than a
generic solution. Okay, people shouldn't want to convert v1.2 to v1.1 or
v1.0, but you can bet your bottom dollar someone will come up with some
weird and justifiable reason to do so ...

Cheers,
Wol


^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-17 12:40 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <b54e3798-4cdc-e1d3-a11e-625290131d37@suse.de>

On 2017/2/17 上午1:05, Coly Li wrote:
> On 2017/2/16 上午10:22, Shaohua Li wrote:
>> On Thu, Feb 16, 2017 at 12:35:22AM +0800, colyli@suse.de wrote:

[snip]
>>
>>>  
>>> -static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>>> -				 struct r1bio *r1_bio)
>>> +static void raid1_read_request(struct mddev *mddev, struct bio *bio)
>>>  {
>>>  	struct r1conf *conf = mddev->private;
>>>  	struct raid1_info *mirror;
>>> +	struct r1bio *r1_bio;
>>>  	struct bio *read_bio;
>>>  	struct bitmap *bitmap = mddev->bitmap;
>>>  	const int op = bio_op(bio);
>>> @@ -1083,7 +1101,34 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>>>  	int max_sectors;
>>>  	int rdisk;
>>>  
>>> -	wait_barrier(conf, bio);
>>> +	/*
>>> +	 * Still need barrier for READ in case that whole
>>> +	 * array is frozen.
>>> +	 */
>>> +	wait_read_barrier(conf, bio->bi_iter.bi_sector);
>>> +	bitmap = mddev->bitmap;
>>> +
>>> +	/*
>>> +	 * make_request() can abort the operation when read-ahead is being
>>> +	 * used and no empty request is available.
>>> +	 *
>>> +	 */
>>> +	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>>> +	r1_bio->master_bio = bio;
>>> +	r1_bio->sectors = bio_sectors(bio);
>>> +	r1_bio->state = 0;
>>> +	r1_bio->mddev = mddev;
>>> +	r1_bio->sector = bio->bi_iter.bi_sector;
>>
>> This part looks unnecessary complicated. If you change raid1_make_request to
>> something like __raid1_make_reques, add a new raid1_make_request and do bio
>> split there, then call __raid1_make_request for each splitted bio, then you
>> don't need to duplicate the r1_bio allocation parts for read/write.
>>
> 
> Aha, good point! I will do that.

I find adding one more wrap function increases stack depth in raid1 I/O
path. Finally I choose to use a static inline function to allocate
r1bio, other than wrap one more function.


+static inline struct r1bio *
+alloc_r1bio(struct mddev *mddev, struct bio *bio, sector_t sectors_handled)
+{
+       struct r1conf *conf = mddev->private;
+       struct r1bio *r1_bio = NULL;
+
+
+       r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+       if (!r1_bio) {
+               WARN(1, "%s: fail to allocate r1_bio.\n", __func__);
+               return NULL;
+       }
+
+       r1_bio->master_bio = bio;
+       r1_bio->sectors = bio_sectors(bio) - sectors_handled;
+       r1_bio->state = 0;
+       r1_bio->mddev = mddev;
+       r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+
+       return r1_bio;
+}

A good point is, before jumping to read_again in raid1_read_request() or
jumping to retry_write in raid1_write_request(), this function can help
to remove more redundant code like this,
	r1_bio = alloc_r1bio(mddev, bio, sectors_handled);

I don't check whether alloc_r1bio() returns NULL in neither
raid1_read_request() nor raid1_write_request(). If the allocation
failed, it may cause a NULL pointer deference fault, but I don't plan to
fix it in this patch set, just keep the existing code logic.

Coly



^ permalink raw reply

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: Coly Li @ 2017-02-17 18:35 UTC (permalink / raw)
  To: NeilBrown, linux-raid
  Cc: Shaohua Li, Hannes Reinecke, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <58e7810c-16db-1c45-c981-63c36eb0b1c8@suse.de>

On 2017/2/17 下午3:56, Coly Li wrote:
> On 2017/2/16 下午3:04, NeilBrown wrote:
>> On Thu, Feb 16 2017, colyli@suse.de wrote:
>>
>>> @@ -2393,6 +2455,11 @@ static void handle_write_finished(struct
>>> r1conf *conf, struct r1bio *r1_bio) idx =
>>> sector_to_idx(r1_bio->sector); conf->nr_queued[idx]++; 
>>> spin_unlock_irq(&conf->device_lock); +		/* +		 * In case
>>> freeze_array() is waiting for condition +		 *
>>> get_unqueued_pending() == extra to be true. +		 */ +
>>> wake_up(&conf->wait_barrier); 
>>> md_wakeup_thread(conf->mddev->thread); } else { if
>>> (test_bit(R1BIO_WriteError, &r1_bio->state)) @@ -2529,9 +2596,7
>>> @@ static void raid1d(struct md_thread *thread) retry_list); 
>>> list_del(&r1_bio->retry_list); idx =
>>> sector_to_idx(r1_bio->sector); -
>>> spin_lock_irqsave(&conf->device_lock, flags); 
>>> conf->nr_queued[idx]--; -
>>> spin_unlock_irqrestore(&conf->device_lock, flags);
>>
>> Why do you think it is safe to decrement nr_queued without holding
>> the lock? Surely this could race with handle_write_finished, and an
>> update could be lost.
> 
> conf->nr_queued[idx] is an integer and aligned to 4 bytes address, so
> conf->nr_queued[idx]++ is same to atomic_inc(&conf->nr_queued[idx]),
> it is atomic operation. And there is no ordering requirement, so I
> don't need memory barrier here. This is why I remove spin lock, and
> change it from atomic_t back to int.
> 
> 
> IMHO, the problematic location is not here, but in freeze_array(). Now
> the code assume array is froze when "get_unqueued_pending(conf) ==
> extra" gets true. I think it is incorrect.

Hmm, I am wrong here. conf->nr_queued[idx]++ is not atomic. Yes, it
should be atomic_t, I will fix it in next version.

Coly Li


^ permalink raw reply

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: Coly Li @ 2017-02-17 18:42 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <20170216022526.47zoupqvieetyliw@kernel.org>

On 2017/2/16 上午10:25, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 12:35:23AM +0800, colyli@suse.de wrote:
>> When I run a parallel reading performan testing on a md raid1 device with
>> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
>> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
>> only 2.7GB/s, this is around 50% of the idea performance number.
>>
>> The perf reports locking contention happens at allow_barrier() and
>> wait_barrier() code,
>>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>>    - _raw_spin_lock_irqsave
>>          + 89.92% allow_barrier
>>          + 9.34% __wake_up
>>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>>    - _raw_spin_lock_irq
>>          - 100.00% wait_barrier
>>
>> The reason is, in these I/O barrier related functions,
>>  - raise_barrier()
>>  - lower_barrier()
>>  - wait_barrier()
>>  - allow_barrier()
>> They always hold conf->resync_lock firstly, even there are only regular
>> reading I/Os and no resync I/O at all. This is a huge performance penalty.
>>
>> The solution is a lockless-like algorithm in I/O barrier code, and only
>> holding conf->resync_lock when it has to.
>>
>> The original idea is from Hannes Reinecke, and Neil Brown provides
>> comments to improve it. I continue to work on it, and make the patch into
>> current form.
>>
>> In the new simpler raid1 I/O barrier implementation, there are two
>> wait barrier functions,
>>  - wait_barrier()
>>    Which calls _wait_barrier(), is used for regular write I/O. If there is
>>    resync I/O happening on the same I/O barrier bucket, or the whole
>>    array is frozen, task will wait until no barrier on same barrier bucket,
>>    or the whold array is unfreezed.
>>  - wait_read_barrier()
>>    Since regular read I/O won't interfere with resync I/O (read_balance()
>>    will make sure only uptodate data will be read out), it is unnecessary
>>    to wait for barrier in regular read I/Os, waiting in only necessary
>>    when the whole array is frozen.
>>
>> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
>> barrier[idx] are very carefully designed in raise_barrier(),
>> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
>> avoid unnecessary spin locks in these functions. Once conf->
>> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
>> has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
>> raised in same barrier bucket index and array is not frozen, the regular
>> I/O doesn't need to hold conf->resync_lock, it can just increase
>> conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
>> very similar to _wait_barrier(), the only difference is it only waits when
>> array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
>> code almostly gets rid of all spin lock cost.
>>
>> This patch significantly improves raid1 reading peroformance. From my
>> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
>> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
>> increases from 2.7GB/s to 4.6GB/s (+70%).
>>
>> Changelog
>> V3:
>> - Add smp_mb__after_atomic() as Shaohua and Neil suggested.
>> - Change conf->nr_queued[] from atomic_t to int.
> 
> I missed this part. In the code, the nr_queued sometimes is protected by
> device_lock, sometimes (raid1d) no protection at all. Can you explain this?

I make a mistake here, integer set is atomic, but integer plus is not.
conf->nr_queued[] must be atomic_t here, otherwise it is racy. I will
fix it in V4 patch.

Thanks for the review.

Coly


^ permalink raw reply

* Process stuck in md_flush_request (state: D)
From: Les Stroud @ 2017-02-17 19:05 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <36A8825E-F387-4ED8-8672-976094B3BEBB@lesstroud.com>


I have a problem with processes entering an uninterruptible sleep state in md_flush_request and never returning. I having trouble identifying the underlying issue. I’m hoping someone on here may be able to help.

The servers in question are running in aws (xen hvm) with kernel 3.8.13-118.16.2.el6uek.x86_64.  The server has two mounts.  The first is vanilla ext4.  The second is a software RAID0 array, striped with 256k chunks, buiIt with md.  It’s file system is ext4. 

The most immediately and obvious symptom of the issue are kernel errors “kernel: INFO task [some process]: blocked for more than 120 seconds.”.  Shortly there after, other processes start entering the same uninterruptible wait state (D). This almost always impacts ssh logins.

The problem does not occur when the system is under load, or was recently under load.  It happens when the system is quiet (no cpu, very little io).

I’ve been able to gather a little information from the logs and process tree (when I could get in). It seems to start with an md process (jdb2/md-127-8). Most of the time it’s queuing up kworker processes behind it.  However, I’ve seen other processes stuck as well.  When I can find a way to execute commands on the box, some of them will work, some of them will hand immediately, and others will work a few times before hanging themselves. Every time I have been able to look, I’ve found these “hung” processes sitting in one of two methods md_flush_cache and md_flush_request.  


I would appreciate any help you can provide to diagnose and resolve this issue, find an available patch, or report a bug (if necessary).



Here is a piece of one of the call traces:

	Feb  6 15:41:57 server1 kernel: INFO: task jbd2/md127-8:2511 blocked for more than 120 seconds.
	Feb  6 15:41:57 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	Feb  6 15:41:57 server1 kernel: jbd2/md127-8    D ffffffff815c6d00     0  2511      2 0x00000000
	Feb  6 15:41:57 server1 kernel: ffff8803f8a659c8 0000000000000046 ffff8803f8a65fd8 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8803f8a64010 0000000000013040 0000000000013040 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8803f8a65fd8 0000000000013040 ffff8803f95b8300 ffff8803f5c3e340
	Feb  6 15:41:57 server1 kernel: Call Trace:
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a0ea9>] schedule+0x29/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b886>] md_flush_request+0x86/0x120
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086da0>] ? wake_up_bit+0x40/0x40
	Feb  6 15:41:57 server1 kernel: [<ffffffffa00ab6eb>] raid0_make_request+0x11b/0x200 [raid0]
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126c83d>] ? generic_make_request_checks+0x1ad/0x400
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b60c>] md_make_request+0xdc/0x240
	Feb  6 15:41:57 server1 kernel: [<ffffffff81135365>] ? mempool_alloc_slab+0x15/0x20
	Feb  6 15:41:57 server1 kernel: [<ffffffff811354d0>] ? mempool_alloc+0x60/0x170
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cb5a>] generic_make_request+0xca/0x100
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cc09>] submit_bio+0x79/0x160
	Feb  6 15:41:57 server1 kernel: [<ffffffff811ce0e5>] ? bio_alloc_bioset+0x65/0x120
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c8e50>] _submit_bh+0x130/0x200
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c8f30>] submit_bh+0x10/0x20
	Feb  6 15:41:57 server1 kernel: [<ffffffffa0167fbd>] journal_submit_commit_record+0x14d/0x1d0 [jbd2]
	Feb  6 15:41:57 server1 kernel: [<ffffffffa016e380>] ? journal_free_journal_head+0x60/0x70 [jbd2]
	Feb  6 15:41:57 server1 kernel: [<ffffffffa0168eab>] jbd2_journal_commit_transaction+0xcfb/0x1950 [jbd2]
	Feb  6 15:41:57 server1 kernel: [<ffffffffa016f213>] kjournald2+0xf3/0x3e0 [jbd2]
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086da0>] ? wake_up_bit+0x40/0x40
	Feb  6 15:41:57 server1 kernel: [<ffffffffa016f120>] ? commit_timeout+0x10/0x10 [jbd2]
	Feb  6 15:41:57 server1 kernel: [<ffffffff810864ee>] kthread+0xce/0xe0
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff815aa9c8>] ret_from_fork+0x58/0x90
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: INFO: task ora_ckpt_rdbdev:1200 blocked for more than 120 seconds.
	Feb  6 15:41:57 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	Feb  6 15:41:57 server1 kernel: ora_ckpt_rdbdev D ffffffff815c6d00     0  1200      1 0x00000080
	Feb  6 15:41:57 server1 kernel: ffff8803f5327898 0000000000000082 ffff8803f5327fd8 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8803f5326010 0000000000013040 0000000000013040 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8803f5327fd8 0000000000013040 ffffffff818c3420 ffff8803f7ad4140
	Feb  6 15:41:57 server1 kernel: Call Trace:
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a0ea9>] schedule+0x29/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a119e>] schedule_preempt_disabled+0xe/0x10
	Feb  6 15:41:57 server1 kernel: [<ffffffff8159f767>] __mutex_lock_slowpath+0x177/0x220
	Feb  6 15:41:57 server1 kernel: [<ffffffff81138e51>] ? prep_new_page+0x111/0x180
	Feb  6 15:41:57 server1 kernel: [<ffffffff8159f5cb>] mutex_lock+0x2b/0x50
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d400e>] do_blockdev_direct_IO+0x7be/0x860
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d40fc>] __blockdev_direct_IO+0x4c/0x50
	Feb  6 15:41:57 server1 kernel: [<ffffffffa0191ab0>] ? ext4_get_block_write+0x20/0x20 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffffa01d0c78>] ext4_ind_direct_IO+0xf8/0x480 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffffa0191ab0>] ? ext4_get_block_write+0x20/0x20 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffffa019331b>] ext4_ext_direct_IO+0x16b/0x240 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffffa01934c9>] ext4_direct_IO+0xd9/0x180 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffff811349e2>] generic_file_read_iter+0x122/0x130
	Feb  6 15:41:57 server1 kernel: [<ffffffff8119615c>] do_aio_read+0xbc/0xd0
	Feb  6 15:41:57 server1 kernel: [<ffffffff81196217>] do_sync_read+0xa7/0xe0
	Feb  6 15:41:57 server1 kernel: [<ffffffff81196521>] vfs_read+0xb1/0x130
	Feb  6 15:41:57 server1 kernel: [<ffffffff81196eaa>] sys_pread64+0x9a/0xb0
	Feb  6 15:41:57 server1 kernel: [<ffffffff815aaa79>] system_call_fastpath+0x16/0x1b
	Feb  6 15:41:57 server1 kernel: INFO: task kworker/0:5:30438 blocked for more than 120 seconds.
	Feb  6 15:41:57 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	Feb  6 15:41:57 server1 kernel: kworker/0:5     D 0000000000000000     0 30438      2 0x00000080
	Feb  6 15:41:57 server1 kernel: ffff8803f7df5ad8 0000000000000046 ffff8803f7df5fd8 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8803f7df4010 0000000000013040 0000000000013040 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8803f7df5fd8 0000000000013040 ffff8802958be500 ffff8802acd20200
	Feb  6 15:41:57 server1 kernel: Call Trace:
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a0ea9>] schedule+0x29/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b886>] md_flush_request+0x86/0x120
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086da0>] ? wake_up_bit+0x40/0x40
	Feb  6 15:41:57 server1 kernel: [<ffffffff8159eed5>] ? schedule_timeout+0x195/0x220
	Feb  6 15:41:57 server1 kernel: [<ffffffffa00ab6eb>] raid0_make_request+0x11b/0x200 [raid0]
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126c83d>] ? generic_make_request_checks+0x1ad/0x400
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b60c>] md_make_request+0xdc/0x240
	Feb  6 15:41:57 server1 kernel: [<ffffffff81135365>] ? mempool_alloc_slab+0x15/0x20
	Feb  6 15:41:57 server1 kernel: [<ffffffff811354d0>] ? mempool_alloc+0x60/0x170
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cb5a>] generic_make_request+0xca/0x100
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cc09>] submit_bio+0x79/0x160
	Feb  6 15:41:57 server1 kernel: [<ffffffff811ce0e5>] ? bio_alloc_bioset+0x65/0x120
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126fe68>] blkdev_issue_flush+0xb8/0xf0
	Feb  6 15:41:57 server1 kernel: [<ffffffffa018c807>] ext4_sync_file+0x167/0x300 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c642e>] vfs_fsync_range+0x2e/0x30
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c6491>] generic_write_sync+0x41/0x50
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d1d8b>] dio_complete+0x11b/0x140
	Feb  6 15:41:57 server1 kernel: [<ffffffff8107ce0c>] ? need_to_create_worker+0x1c/0x40
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d1f04>] dio_aio_complete_work+0x24/0x30
	Feb  6 15:41:57 server1 kernel: [<ffffffff8107ea50>] process_one_work+0x180/0x420
	Feb  6 15:41:57 server1 kernel: [<ffffffff8108119e>] worker_thread+0x12e/0x390
	Feb  6 15:41:57 server1 kernel: [<ffffffff81081070>] ? manage_workers+0x180/0x180
	Feb  6 15:41:57 server1 kernel: [<ffffffff810864ee>] kthread+0xce/0xe0
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff815aa9c8>] ret_from_fork+0x58/0x90
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: INFO: task kworker/1:2:31430 blocked for more than 120 seconds.
	Feb  6 15:41:57 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	Feb  6 15:41:57 server1 kernel: kworker/1:2     D 0000000000000001     0 31430      2 0x00000080
	Feb  6 15:41:57 server1 kernel: ffff8802a7dcfb58 0000000000000046 ffff8802a7dcffd8 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8802a7dce010 0000000000013040 0000000000013040 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8802a7dcffd8 0000000000013040 ffff8803f6dca3c0 ffff8802a25bc580
	Feb  6 15:41:57 server1 kernel: Call Trace:
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a0ea9>] schedule+0x29/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff8159eed5>] schedule_timeout+0x195/0x220
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126c83d>] ? generic_make_request_checks+0x1ad/0x400
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b60c>] ? md_make_request+0xdc/0x240
	Feb  6 15:41:57 server1 kernel: [<ffffffff81135365>] ? mempool_alloc_slab+0x15/0x20
	Feb  6 15:41:57 server1 kernel: [<ffffffff811354d0>] ? mempool_alloc+0x60/0x170
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a0cda>] wait_for_common+0x11a/0x170
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cb5a>] ? generic_make_request+0xca/0x100
	Feb  6 15:41:57 server1 kernel: [<ffffffff8109a3d0>] ? try_to_wake_up+0x2e0/0x2e0
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a0e0d>] wait_for_completion+0x1d/0x20
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126fe70>] blkdev_issue_flush+0xc0/0xf0
	Feb  6 15:41:57 server1 kernel: [<ffffffffa018c8ef>] ext4_sync_file+0x24f/0x300 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c642e>] vfs_fsync_range+0x2e/0x30
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c6491>] generic_write_sync+0x41/0x50
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d1d8b>] dio_complete+0x11b/0x140
	Feb  6 15:41:57 server1 kernel: [<ffffffff8107ce0c>] ? need_to_create_worker+0x1c/0x40
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d1f04>] dio_aio_complete_work+0x24/0x30
	Feb  6 15:41:57 server1 kernel: [<ffffffff8107ea50>] process_one_work+0x180/0x420
	Feb  6 15:41:57 server1 kernel: [<ffffffff8108119e>] worker_thread+0x12e/0x390
	Feb  6 15:41:57 server1 kernel: [<ffffffff81081070>] ? manage_workers+0x180/0x180
	Feb  6 15:41:57 server1 kernel: [<ffffffff810864ee>] kthread+0xce/0xe0
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff815aa9c8>] ret_from_fork+0x58/0x90
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: INFO: task kworker/3:0:32582 blocked for more than 120 seconds.
	Feb  6 15:41:57 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	Feb  6 15:41:57 server1 kernel: kworker/3:0     D 0000000000000003     0 32582      2 0x00000080
	Feb  6 15:41:57 server1 kernel: ffff8803f6e97ad8 0000000000000046 ffff8803f6e97fd8 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8803f6e96010 0000000000013040 0000000000013040 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff8803f6e97fd8 0000000000013040 ffff8802b7806440 ffff8802a1f92540
	Feb  6 15:41:57 server1 kernel: Call Trace:
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a0ea9>] schedule+0x29/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b886>] md_flush_request+0x86/0x120
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086da0>] ? wake_up_bit+0x40/0x40
	Feb  6 15:41:57 server1 kernel: [<ffffffffa00ab6eb>] raid0_make_request+0x11b/0x200 [raid0]
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126c83d>] ? generic_make_request_checks+0x1ad/0x400
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b60c>] md_make_request+0xdc/0x240
	Feb  6 15:41:57 server1 kernel: [<ffffffff81135365>] ? mempool_alloc_slab+0x15/0x20
	Feb  6 15:41:57 server1 kernel: [<ffffffff811354d0>] ? mempool_alloc+0x60/0x170
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cb5a>] generic_make_request+0xca/0x100
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cc09>] submit_bio+0x79/0x160
	Feb  6 15:41:57 server1 kernel: [<ffffffff811ce0e5>] ? bio_alloc_bioset+0x65/0x120
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126fe68>] blkdev_issue_flush+0xb8/0xf0
	Feb  6 15:41:57 server1 kernel: [<ffffffffa018c807>] ext4_sync_file+0x167/0x300 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c642e>] vfs_fsync_range+0x2e/0x30
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c6491>] generic_write_sync+0x41/0x50
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d1d8b>] dio_complete+0x11b/0x140
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d1f04>] dio_aio_complete_work+0x24/0x30
	Feb  6 15:41:57 server1 kernel: [<ffffffff8107ea50>] process_one_work+0x180/0x420
	Feb  6 15:41:57 server1 kernel: [<ffffffff8108119e>] worker_thread+0x12e/0x390
	Feb  6 15:41:57 server1 kernel: [<ffffffff81081070>] ? manage_workers+0x180/0x180
	Feb  6 15:41:57 server1 kernel: [<ffffffff810864ee>] kthread+0xce/0xe0
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff815aa9c8>] ret_from_fork+0x58/0x90
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: INFO: task kworker/1:1:416 blocked for more than 120 seconds.
	Feb  6 15:41:57 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	Feb  6 15:41:57 server1 kernel: kworker/1:1     D ffffffff815c6d00     0   416      2 0x00000080
	Feb  6 15:41:57 server1 kernel: ffff88029ff75ad8 0000000000000046 ffff88029ff75fd8 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff88029ff74010 0000000000013040 0000000000013040 0000000000013040
	Feb  6 15:41:57 server1 kernel: ffff88029ff75fd8 0000000000013040 ffff8803f95b8300 ffff8802b1d983c0
	Feb  6 15:41:57 server1 kernel: Call Trace:
	Feb  6 15:41:57 server1 kernel: [<ffffffff815a0ea9>] schedule+0x29/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b886>] md_flush_request+0x86/0x120
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086da0>] ? wake_up_bit+0x40/0x40
	Feb  6 15:41:57 server1 kernel: [<ffffffffa00ab6eb>] raid0_make_request+0x11b/0x200 [raid0]
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126c83d>] ? generic_make_request_checks+0x1ad/0x400
	Feb  6 15:41:57 server1 kernel: [<ffffffff8146b60c>] md_make_request+0xdc/0x240
	Feb  6 15:41:57 server1 kernel: [<ffffffff81135365>] ? mempool_alloc_slab+0x15/0x20
	Feb  6 15:41:57 server1 kernel: [<ffffffff811354d0>] ? mempool_alloc+0x60/0x170
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cb5a>] generic_make_request+0xca/0x100
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126cc09>] submit_bio+0x79/0x160
	Feb  6 15:41:57 server1 kernel: [<ffffffff811ce0e5>] ? bio_alloc_bioset+0x65/0x120
	Feb  6 15:41:57 server1 kernel: [<ffffffff8126fe68>] blkdev_issue_flush+0xb8/0xf0
	Feb  6 15:41:57 server1 kernel: [<ffffffffa018c807>] ext4_sync_file+0x167/0x300 [ext4]
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c642e>] vfs_fsync_range+0x2e/0x30
	Feb  6 15:41:57 server1 kernel: [<ffffffff811c6491>] generic_write_sync+0x41/0x50
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d1d8b>] dio_complete+0x11b/0x140
	Feb  6 15:41:57 server1 kernel: [<ffffffff811d1f04>] dio_aio_complete_work+0x24/0x30
	Feb  6 15:41:57 server1 kernel: [<ffffffff8107ea50>] process_one_work+0x180/0x420
	Feb  6 15:41:57 server1 kernel: [<ffffffff8108119e>] worker_thread+0x12e/0x390
	Feb  6 15:41:57 server1 kernel: [<ffffffff81081070>] ? manage_workers+0x180/0x180
	Feb  6 15:41:57 server1 kernel: [<ffffffff810864ee>] kthread+0xce/0xe0
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70
	Feb  6 15:41:57 server1 kernel: [<ffffffff815aa9c8>] ret_from_fork+0x58/0x90
	Feb  6 15:41:57 server1 kernel: [<ffffffff81086420>] ? kthread_freezable_should_stop+0x70/0x70



Thank you for the time,
LES





^ permalink raw reply

* [PATCH V4 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: colyli @ 2017-02-17 19:05 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.

The idea of sliding resync widow is awesome, but code complexity is a
challenge. Sliding resync window requires several variables to work
collectively, this is complexed and very hard to make it work correctly.
Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
to fix the original resync window patch. This is not the end, any further
related modification may easily introduce more regreassion.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a global unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflicts, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be reduced to an acceptable number if there are enough
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio requests won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
   For random I/O 64MB is large enough for both read and write requests,
   for sequential I/O considering underlying block layer may merge them
   into larger requests, 64MB is still good enough.
   Neil also points out that for resync operation, "we want the resync to
   move from region to region fairly quickly so that the slowness caused
   by having to synchronize with the resync is averaged out over a fairly
   small time frame". For full speed resync, 64MB should take less then 1
   second. When resync is competing with other I/O, it could take up a few
   minutes. Therefore 64MB size is fairly good range for resync.

 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
        #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
        #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
   this patch makes the bellowed members of struct r1conf from integer
   to array of integers,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     *nr_pending;
        +       int                     *nr_waiting;
        +       int                     *nr_queued;
        +       int                     *barrier;
   number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
   kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
   barrier buckets, and each array of integers occupies single memory page.
   1024 means for a request which is smaller than the I/O barrier unit size
   has ~0.1% chance to wait for resync to pause, which is quite a small
   enough fraction. Also requesting single memory page is more friendly to
   kernel page allocator than larger memory size.

 - I/O barrier bucket is indexed by bio start sector
   If multiple I/O requests hit different I/O barrier units, they only need
   to compete I/O barrier with other I/Os which hit the same I/O barrier
   bucket index with each other. The index of a barrier bucket which a
   bio should look for is calculated by sector_to_idx() which is defined
   in raid1.h as an inline function,
        static inline int sector_to_idx(sector_t sector)
        {
                return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
                                BARRIER_BUCKETS_NR_BITS);
        }
   Here sector_nr is the start sector number of a bio.

 - Single bio won't go across boundary of a I/O barrier unit
   If a request goes across boundary of barrier unit, it will be split. A
   bio may be split in raid1_make_request() or raid1_sync_request(), if
   sectors returned by align_to_barrier_unit_end() is smaller than
   original bio size.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will conflict within a single barrier units. So the I/O behavior is
   similar to single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of conflict might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket indexs with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low conflict rate with much
   simpler barrier algorithm and implementation.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
        spin_lock_irqsave(&conf->device_lock, flags);
        conf->nr_queued[idx]--;
        spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
        atomic_dec(&conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - Mainline raid1 code split original raid1_make_request() into
   raid1_read_request() and raid1_write_request(). If the original bio
   goes across an I/O barrier unit size, this bio will be split before
   calling raid1_read_request() or raid1_write_request(),  this change
   the code logic more simple and clear.
 - In this patch wait_barrier() is moved from raid1_make_request() to
   raid1_write_request(). In raid_read_request(), original wait_barrier()
   is replaced by raid1_read_request().
   The differnece is wait_read_barrier() only waits if array is frozen,
   using different barrier function in different code path makes the code
   more clean and easy to read.
Changelog
V4:
- Add alloc_r1bio() to remove redundant r1bio memory allocation code.
- Fix many typos in patch comments.
- Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
V3:
- Rebase the patch against latest upstream kernel code.
- Many fixes by review comments from Neil,
  - Back to use pointers to replace arraries in struct r1conf
  - Remove total_barriers from struct r1conf
  - Add more patch comments to explain how/why the values of
    BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
  - Use get_unqueued_pending() to replace get_all_pendings() and
    get_all_queued()
  - Increase bucket number from 512 to 1024
- Change code comments format by review from Shaohua.
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
  bounday, to make the code more simple, by suggestion from Shaohua and
  Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
  confilict between resync I/O and sequential write I/O, by suggestion from
  Shaohua.
- Add conf->total_barriers to record barrier depth, which is used to
  control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
  allocated in memory page. To make the code more simple, V2 patch moves
  the memory space into struct r1conf, like this,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     nr_pending[BARRIER_BUCKETS_NR];
        +       int                     nr_waiting[BARRIER_BUCKETS_NR];
        +       int                     nr_queued[BARRIER_BUCKETS_NR];
        +       int                     barrier[BARRIER_BUCKETS_NR];
  This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
  raid1_make_write_request().
V1:
- Original RFC patch for comments

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 469 ++++++++++++++++++++++++++++++-----------------------
 drivers/md/raid1.h |  57 ++++---
 2 files changed, 299 insertions(+), 227 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647..621b3ae 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -71,9 +71,8 @@
  */
 static int max_queued_requests = 1024;
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
 
 #define raid1_log(md, fmt, args...)				\
 	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
@@ -100,7 +99,6 @@ static void r1bio_pool_free(void *r1_bio, void *data)
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -215,7 +213,7 @@ static void put_buf(struct r1bio *r1_bio)
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	lower_barrier(conf);
+	lower_barrier(conf, r1_bio->sector);
 }
 
 static void reschedule_retry(struct r1bio *r1_bio)
@@ -223,10 +221,12 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	unsigned long flags;
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
+	int idx;
 
+	idx = sector_to_idx(r1_bio->sector);
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued ++;
+	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -243,7 +243,6 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	struct bio *bio = r1_bio->master_bio;
 	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t start_next_window = r1_bio->start_next_window;
 	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (bio->bi_phys_segments) {
@@ -269,7 +268,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
 		 * Wake up any possible resync thread that waits for the device
 		 * to go idle.
 		 */
-		allow_barrier(conf, start_next_window, bi_sector);
+		allow_barrier(conf, bi_sector);
 	}
 }
 
@@ -517,6 +516,25 @@ static void raid1_end_write_request(struct bio *bio)
 		bio_put(to_put);
 }
 
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+					  sector_t sectors)
+{
+	sector_t len;
+
+	WARN_ON(sectors == 0);
+	/*
+	 * len is the number of sectors from start_sector to end of the
+	 * barrier unit which start_sector belongs to.
+	 */
+	len = round_up(start_sector + 1, BARRIER_UNIT_SECTOR_SIZE) -
+	      start_sector;
+
+	if (len > sectors)
+		len = sectors;
+
+	return len;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -813,168 +831,168 @@ static void flush_pending_writes(struct r1conf *conf)
  */
 static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
+	int idx = sector_to_idx(sector_nr);
+
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
-	conf->next_resync = sector_nr;
+	conf->barrier[idx]++;
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
-	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
-	 *    the max count which allowed.
-	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
-	 *    next resync will reach to the window which normal bios are
-	 *    handling.
-	 * D: while there are any active requests in the current window.
+	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+	 *    existing in corresponding I/O barrier bucket.
+	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning reaches
+	 *    max resync count which allowed on current I/O barrier bucket.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->array_frozen &&
-			    conf->barrier < RESYNC_DEPTH &&
-			    conf->current_window_requests == 0 &&
-			    (conf->start_next_window >=
-			     conf->next_resync + RESYNC_SECTORS),
+			     !conf->nr_pending[idx] &&
+			     conf->barrier[idx] < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	unsigned long flags;
-	BUG_ON(conf->barrier <= 0);
+	int idx = sector_to_idx(sector_nr);
+
+	BUG_ON(conf->barrier[idx] <= 0);
+
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	conf->nr_pending--;
+	conf->barrier[idx]--;
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	bool wait = false;
-
-	if (conf->array_frozen || !bio)
-		wait = true;
-	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
-		if ((conf->mddev->curr_resync_completed
-		     >= bio_end_sector(bio)) ||
-		    (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-		     <= bio->bi_iter.bi_sector))
-			wait = false;
-		else
-			wait = true;
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->array_frozen || conf->barrier[idx]) {
+		conf->nr_waiting[idx]++;
+		/* Wait for the barrier to drop. */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen && !conf->barrier[idx],
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	return wait;
+	conf->nr_pending[idx]++;
+	spin_unlock_irq(&conf->resync_lock);
 }
 
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	sector_t sector = 0;
+	int idx = sector_to_idx(sector_nr);
 
 	spin_lock_irq(&conf->resync_lock);
-	if (need_to_wait_for_sync(conf, bio)) {
-		conf->nr_waiting++;
-		/* Wait for the barrier to drop.
-		 * However if there are already pending
-		 * requests (preventing the barrier from
-		 * rising completely), and the
-		 * per-process bio queue isn't empty,
-		 * then don't wait, as we need to empty
-		 * that queue to allow conf->start_next_window
-		 * to increase.
-		 */
-		raid1_log(conf->mddev, "wait barrier");
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->array_frozen &&
-				    (!conf->barrier ||
-				     ((conf->start_next_window <
-				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
-		conf->nr_waiting--;
-	}
-
-	if (bio && bio_data_dir(bio) == WRITE) {
-		if (bio->bi_iter.bi_sector >= conf->next_resync) {
-			if (conf->start_next_window == MaxSector)
-				conf->start_next_window =
-					conf->next_resync +
-					NEXT_NORMALIO_DISTANCE;
-
-			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
-			    <= bio->bi_iter.bi_sector)
-				conf->next_window_requests++;
-			else
-				conf->current_window_requests++;
-			sector = conf->start_next_window;
-		}
+	if (conf->array_frozen) {
+		conf->nr_waiting[idx]++;
+		/* Wait for array to unfreeze */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen,
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
-	return sector;
 }
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = sector_to_idx(sector_nr);
+
+	_wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, int idx)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending--;
-	if (start_next_window) {
-		if (start_next_window == conf->start_next_window) {
-			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-			    <= bi_sector)
-				conf->next_window_requests--;
-			else
-				conf->current_window_requests--;
-		} else
-			conf->current_window_requests--;
-
-		if (!conf->current_window_requests) {
-			if (conf->next_window_requests) {
-				conf->current_window_requests =
-					conf->next_window_requests;
-				conf->next_window_requests = 0;
-				conf->start_next_window +=
-					NEXT_NORMALIO_DISTANCE;
-			} else
-				conf->start_next_window = MaxSector;
-		}
-	}
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = sector_to_idx(sector_nr);
+
+	_allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_allow_barrier(conf, idx);
+}
+
+/* conf->resync_lock should be held */
+static int get_unqueued_pending(struct r1conf *conf)
+{
+	int idx, ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_pending[idx] - conf->nr_queued[idx];
+
+	return ret;
+}
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
-	/* stop syncio and normal IO and wait for everything to
+	/* Stop sync I/O and normal I/O and wait for everything to
 	 * go quite.
-	 * We wait until nr_pending match nr_queued+extra
-	 * This is called in the context of one normal IO request
-	 * that has failed. Thus any sync request that might be pending
-	 * will be blocked by nr_pending, and we need to wait for
-	 * pending IO requests to complete or be queued for re-try.
-	 * Thus the number queued (nr_queued) plus this request (extra)
-	 * must match the number of pending IOs (nr_pending) before
-	 * we continue.
+	 * This is called in two situations:
+	 * 1) management command handlers (reshape, remove disk, quiesce).
+	 * 2) one normal I/O request failed.
+
+	 * After array_frozen is set to 1, new sync IO will be blocked at
+	 * raise_barrier(), and new normal I/O will blocked at _wait_barrier()
+	 * or wait_read_barrier(). The flying I/Os will either complete or be
+	 * queued. When everything goes quite, there are only queued I/Os left.
+
+	 * Every flying I/O contributes to a conf->nr_pending[idx], idx is the
+	 * barrier bucket index which this I/O request hits. When all sync and
+	 * normal I/O are queued, sum of all conf->nr_pending[] will match sum
+	 * of all conf->nr_queued[]. But normal I/O failure is an exception,
+	 * in handle_read_error(), we may call freeze_array() before trying to
+	 * fix the read error. In this case, the error read I/O is not queued,
+	 * so get_unqueued_pending() == 1.
+	 *
+	 * Therefore before this function returns, we need to wait until
+	 * get_unqueued_pendings(conf) gets equal to extra. For
+	 * normal I/O context, extra is 1, in rested situations extra is 0.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
 	raid1_log(conf->mddev, "wait freeze");
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				conf->nr_pending == conf->nr_queued+extra,
-				conf->resync_lock,
-				flush_pending_writes(conf));
+	wait_event_lock_irq_cmd(
+		conf->wait_barrier,
+		get_unqueued_pending(conf) == extra,
+		conf->resync_lock,
+		flush_pending_writes(conf));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
@@ -1070,11 +1088,33 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	kfree(plug);
 }
 
-static void raid1_read_request(struct mddev *mddev, struct bio *bio,
-				 struct r1bio *r1_bio)
+static inline struct r1bio *
+alloc_r1bio(struct mddev *mddev, struct bio *bio, sector_t sectors_handled)
+{
+	struct r1conf *conf = mddev->private;
+	struct r1bio *r1_bio = NULL;
+
+
+	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+	if (!r1_bio) {
+		WARN(1, "%s: fail to allocate r1_bio.\n", __func__);
+		return NULL;
+	}
+
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio) - sectors_handled;
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+
+	return r1_bio;
+}
+
+static void raid1_read_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
 	struct raid1_info *mirror;
+	struct r1bio *r1_bio;
 	struct bio *read_bio;
 	struct bitmap *bitmap = mddev->bitmap;
 	const int op = bio_op(bio);
@@ -1083,8 +1123,30 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	int max_sectors;
 	int rdisk;
 
-	wait_barrier(conf, bio);
+	/*
+	 * Still need barrier for READ in case that whole
+	 * array is frozen.
+	 */
+	wait_read_barrier(conf, bio->bi_iter.bi_sector);
+	bitmap = mddev->bitmap;
+
+	r1_bio = alloc_r1bio(mddev, bio, 0);
+
+	/*
+	 * We might need to issue multiple reads to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of reads in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
+	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
 
+	/*
+	 * make_request() can abort the operation when read-ahead is being
+	 * used and no empty request is available.
+	 */
 read_again:
 	rdisk = read_balance(conf, r1_bio, &max_sectors);
 
@@ -1106,7 +1168,6 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 			   atomic_read(&bitmap->behind_writes) == 0);
 	}
 	r1_bio->read_disk = rdisk;
-	r1_bio->start_next_window = 0;
 
 	read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 	bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1151,22 +1212,16 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 		 */
 		reschedule_retry(r1_bio);
 
-		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
-
-		r1_bio->master_bio = bio;
-		r1_bio->sectors = bio_sectors(bio) - sectors_handled;
-		r1_bio->state = 0;
-		r1_bio->mddev = mddev;
-		r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+		r1_bio = alloc_r1bio(mddev, bio, sectors_handled);
 		goto read_again;
 	} else
 		generic_make_request(read_bio);
 }
 
-static void raid1_write_request(struct mddev *mddev, struct bio *bio,
-				struct r1bio *r1_bio)
+static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
+	struct r1bio *r1_bio;
 	int i, disks;
 	struct bitmap *bitmap = mddev->bitmap;
 	unsigned long flags;
@@ -1180,7 +1235,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -1216,7 +1270,19 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		}
 		finish_wait(&conf->wait_barrier, &w);
 	}
-	start_next_window = wait_barrier(conf, bio);
+	wait_barrier(conf, bio->bi_iter.bi_sector);
+
+	r1_bio = alloc_r1bio(mddev, bio, 0);
+
+	/* We might need to issue multiple writes to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of writes in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
+	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
 
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
@@ -1237,7 +1303,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 
 	disks = conf->raid_disks * 2;
  retry_write:
-	r1_bio->start_next_window = start_next_window;
 	blocked_rdev = NULL;
 	rcu_read_lock();
 	max_sectors = r1_bio->sectors;
@@ -1304,25 +1369,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
-		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		allow_barrier(conf, bio->bi_iter.bi_sector);
 		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
+		wait_barrier(conf, bio->bi_iter.bi_sector);
 		goto retry_write;
 	}
 
@@ -1430,12 +1485,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		/* We need another r1_bio.  It has already been counted
 		 * in bio->bi_phys_segments
 		 */
-		r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
-		r1_bio->master_bio = bio;
-		r1_bio->sectors = bio_sectors(bio) - sectors_handled;
-		r1_bio->state = 0;
-		r1_bio->mddev = mddev;
-		r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+		r1_bio = alloc_r1bio(mddev, bio, sectors_handled);
 		goto retry_write;
 	}
 
@@ -1447,36 +1497,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 
 static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 {
-	struct r1conf *conf = mddev->private;
-	struct r1bio *r1_bio;
+	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
+	struct bio *split;
+	sector_t sectors;
 
-	/*
-	 * make_request() can abort the operation when read-ahead is being
-	 * used and no empty request is available.
-	 *
-	 */
-	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+	make_request_fn = (bio_data_dir(bio) == READ) ?
+			  raid1_read_request : raid1_write_request;
 
-	r1_bio->master_bio = bio;
-	r1_bio->sectors = bio_sectors(bio);
-	r1_bio->state = 0;
-	r1_bio->mddev = mddev;
-	r1_bio->sector = bio->bi_iter.bi_sector;
-
-	/*
-	 * We might need to issue multiple reads to different devices if there
-	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
-	 * no locking will be needed when requests complete.  If it is
-	 * non-zero, then it is the number of not-completed requests.
-	 */
-	bio->bi_phys_segments = 0;
-	bio_clear_flag(bio, BIO_SEG_VALID);
+	/* if bio exceeds barrier unit boundary, split it */
+	do {
+		sectors = align_to_barrier_unit_end(
+				bio->bi_iter.bi_sector, bio_sectors(bio));
+		if (sectors < bio_sectors(bio)) {
+			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
+			bio_chain(split, bio);
+		} else {
+			split = bio;
+		}
 
-	if (bio_data_dir(bio) == READ)
-		raid1_read_request(mddev, bio, r1_bio);
-	else
-		raid1_write_request(mddev, bio, r1_bio);
+		make_request_fn(mddev, split);
+	} while (split != bio);
 }
 
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
@@ -1567,19 +1607,11 @@ static void print_conf(struct r1conf *conf)
 
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
-
-	spin_lock_irq(&conf->resync_lock);
-	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests +=
-		conf->next_window_requests;
-	conf->next_window_requests = 0;
-	spin_unlock_irq(&conf->resync_lock);
 }
 
 static int raid1_spare_active(struct mddev *mddev)
@@ -2326,8 +2358,9 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
 
 static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 {
-	int m;
+	int m, idx;
 	bool fail = false;
+
 	for (m = 0; m < conf->raid_disks * 2 ; m++)
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2353,7 +2386,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		conf->nr_queued++;
+		idx = sector_to_idx(r1_bio->sector);
+		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2475,6 +2509,7 @@ static void raid1d(struct md_thread *thread)
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
+	int idx;
 
 	md_check_recovery(mddev);
 
@@ -2482,17 +2517,17 @@ static void raid1d(struct md_thread *thread)
 	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			idx = sector_to_idx(r1_bio->sector);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued[idx]--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2513,7 +2548,8 @@ static void raid1d(struct md_thread *thread)
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		conf->nr_queued--;
+		idx = sector_to_idx(r1_bio->sector);
+		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2552,7 +2588,6 @@ static int init_resync(struct r1conf *conf)
 					  conf->poolinfo);
 	if (!conf->r1buf_pool)
 		return -ENOMEM;
-	conf->next_resync = 0;
 	return 0;
 }
 
@@ -2581,6 +2616,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	int still_degraded = 0;
 	int good_sectors = RESYNC_SECTORS;
 	int min_bad = 0; /* number of sectors that are bad in all devices */
+	int idx = sector_to_idx(sector_nr);
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
@@ -2630,7 +2666,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting)
+	if (conf->nr_waiting[idx])
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2657,6 +2693,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
+	/* make sure good_sectors won't go across barrier unit boundary */
+	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct md_rdev *rdev;
@@ -2887,6 +2925,26 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	if (!conf)
 		goto abort;
 
+	conf->nr_pending = kcalloc(BARRIER_BUCKETS_NR,
+				   sizeof(int), GFP_KERNEL);
+	if (!conf->nr_pending)
+		goto abort;
+
+	conf->nr_waiting = kcalloc(BARRIER_BUCKETS_NR,
+				   sizeof(int), GFP_KERNEL);
+	if (!conf->nr_waiting)
+		goto abort;
+
+	conf->nr_queued = kcalloc(BARRIER_BUCKETS_NR,
+				  sizeof(int), GFP_KERNEL);
+	if (!conf->nr_queued)
+		goto abort;
+
+	conf->barrier = kcalloc(BARRIER_BUCKETS_NR,
+				sizeof(int), GFP_KERNEL);
+	if (!conf->barrier)
+		goto abort;
+
 	conf->mirrors = kzalloc(sizeof(struct raid1_info)
 				* mddev->raid_disks * 2,
 				 GFP_KERNEL);
@@ -2942,9 +3000,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests = conf->next_window_requests = 0;
-
 	err = -EIO;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
@@ -2987,6 +3042,10 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
 		kfree(conf->poolinfo);
+		kfree(conf->nr_pending);
+		kfree(conf->nr_waiting);
+		kfree(conf->nr_queued);
+		kfree(conf->barrier);
 		kfree(conf);
 	}
 	return ERR_PTR(err);
@@ -3088,6 +3147,10 @@ static void raid1_free(struct mddev *mddev, void *priv)
 	kfree(conf->mirrors);
 	safe_put_page(conf->tmppage);
 	kfree(conf->poolinfo);
+	kfree(conf->nr_pending);
+	kfree(conf->nr_waiting);
+	kfree(conf->nr_queued);
+	kfree(conf->barrier);
 	kfree(conf);
 }
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index c52ef42..3442e8f 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -1,6 +1,29 @@
 #ifndef _RAID1_H
 #define _RAID1_H
 
+/*
+ * each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS	17
+#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
+/*
+ * In struct r1conf, the following members are related to I/O barrier
+ * buckets,
+ *	int	*nr_pending;
+ *	int	*nr_waiting;
+ *	int	*nr_queued;
+ *	int	*barrier;
+ * Each of them points to array of integers, each array is designed to
+ * have BARRIER_BUCKETS_NR elements and occupy a single memory page. The
+ * data width of integer variables is 4, equal to 1<<(ilog2(sizeof(int))),
+ * BARRIER_BUCKETS_NR_BITS is defined as (PAGE_SHIFT - ilog2(sizeof(int)))
+ * to make sure an array of integers with BARRIER_BUCKETS_NR elements just
+ * exactly occupies a single memory page.
+ */
+#define BARRIER_BUCKETS_NR_BITS		(PAGE_SHIFT - ilog2(sizeof(int)))
+#define BARRIER_BUCKETS_NR		(1<<BARRIER_BUCKETS_NR_BITS)
+
 struct raid1_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
@@ -35,25 +58,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* During resync, read_balancing is only allowed on the part
-	 * of the array that has been resynced.  'next_resync' tells us
-	 * where that is.
-	 */
-	sector_t		next_resync;
-
-	/* When raid1 starts resync, we divide array into four partitions
-	 * |---------|--------------|---------------------|-------------|
-	 *        next_resync   start_next_window       end_window
-	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
-	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
-	 * current_window_requests means the count of normalIO between
-	 *   start_next_window and end_window.
-	 * next_window_requests means the count of normalIO after end_window.
-	 * */
-	sector_t		start_next_window;
-	int			current_window_requests;
-	int			next_window_requests;
-
 	spinlock_t		device_lock;
 
 	/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +83,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending;
-	int			nr_waiting;
-	int			nr_queued;
-	int			barrier;
+	int			*nr_pending;
+	int			*nr_waiting;
+	int			*nr_queued;
+	int			*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +139,6 @@ struct r1bio {
 						 * in this BehindIO request
 						 */
 	sector_t		sector;
-	sector_t		start_next_window;
 	int			sectors;
 	unsigned long		state;
 	struct mddev		*mddev;
@@ -185,4 +188,10 @@ enum r1bio_state {
 	R1BIO_WriteError,
 	R1BIO_FailFast,
 };
+
+static inline int sector_to_idx(sector_t sector)
+{
+	return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
+			 BARRIER_BUCKETS_NR_BITS);
+}
 #endif
-- 
2.6.6


^ permalink raw reply related

* [PATCH V4 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: colyli @ 2017-02-17 19:05 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <1487358357-123924-1-git-send-email-colyli@suse.de>

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which calls _wait_barrier(), is used for regular write I/O. If there is
   resync I/O happening on the same I/O barrier bucket, or the whole
   array is frozen, task will wait until no barrier on same barrier bucket,
   or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), it is unnecessary
   to wait for barrier in regular read I/Os, waiting in only necessary
   when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf->resync_lock, it can just increase
conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Changelog
V4:
- Change conf->nr_queued[] to atomic_t.
- Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf->nr_queued[] from atomic_t to int.
- Change conf->array_frozen from atomic_t back to int, and use
  READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
  in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
  wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
  to fix a deadlock between  _wait_barrier()/wait_read_barrier and
  freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.
V1:
- Original RFC patch for comments.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 165 ++++++++++++++++++++++++++++++++++++-----------------
 drivers/md/raid1.h |  31 +++++-----
 2 files changed, 130 insertions(+), 66 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 621b3ae..7376b3d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -226,7 +226,7 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	idx = sector_to_idx(r1_bio->sector);
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued[idx]++;
+	atomic_inc(&conf->nr_queued[idx]);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -836,11 +836,21 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->nr_waiting[idx]),
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier[idx]++;
+	atomic_inc(&conf->barrier[idx]);
+	/*
+	 * In raise_barrier() we firstly increase conf->barrier[idx] then
+	 * check conf->nr_pending[idx]. In _wait_barrier() we firstly
+	 * increase conf->nr_pending[idx] then check conf->barrier[idx].
+	 * A memory barrier here to make sure conf->nr_pending[idx] won't
+	 * be fetched before conf->barrier[idx] is increased. Otherwise
+	 * there will be a race between raise_barrier() and _wait_barrier().
+	 */
+	smp_mb__after_atomic();
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
@@ -851,42 +861,81 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->array_frozen &&
-			     !conf->nr_pending[idx] &&
-			     conf->barrier[idx] < RESYNC_DEPTH,
+			     !atomic_read(&conf->nr_pending[idx]) &&
+			     atomic_read(&conf->barrier[idx]) < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
 static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	unsigned long flags;
 	int idx = sector_to_idx(sector_nr);
 
-	BUG_ON(conf->barrier[idx] <= 0);
+	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
 
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier[idx]--;
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->barrier[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
 static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen || conf->barrier[idx]) {
-		conf->nr_waiting[idx]++;
-		/* Wait for the barrier to drop. */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen && !conf->barrier[idx],
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/*
+	 * We need to increase conf->nr_pending[idx] very early here,
+	 * then raise_barrier() can be blocked when it waits for
+	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
+	 * conf->resync_lock when there is no barrier raised in same
+	 * barrier unit bucket. Also if the array is frozen, I/O
+	 * should be blocked until array is unfrozen.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+	/*
+	 * In _wait_barrier() we firstly increase conf->nr_pending[idx], then
+	 * check conf->barrier[idx]. In raise_barrier() we firstly increase
+	 * conf->barrier[idx], then check conf->nr_pending[idx]. A memory
+	 * barrier is necessary here to make sure conf->barrier[idx] won't be
+	 * fetched before conf->nr_pending[idx] is increased. Otherwise there
+	 * will be a race between _wait_barrier() and raise_barrier().
+	 */
+	smp_mb__after_atomic();
+
+	/*
+	 * Don't worry about checking two atomic_t variables at same time
+	 * here. If during we check conf->barrier[idx], the array is
+	 * frozen (conf->array_frozen is 1), and chonf->barrier[idx] is
+	 * 0, it is safe to return and make the I/O continue. Because the
+	 * array is frozen, all I/O returned here will eventually complete
+	 * or be queued, no race will happen. See code comment in
+	 * frozen_array().
+	 */
+	if (!READ_ONCE(conf->array_frozen) &&
+	    !atomic_read(&conf->barrier[idx]))
+		return;
 
-	conf->nr_pending[idx]++;
+	/*
+	 * After holding conf->resync_lock, conf->nr_pending[idx]
+	 * should be decreased before waiting for barrier to drop.
+	 * Otherwise, we may encounter a race condition because
+	 * raise_barrer() might be waiting for conf->nr_pending[idx]
+	 * to be 0 at same time.
+	 */
+	spin_lock_irq(&conf->resync_lock);
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/*
+	 * In case freeze_array() is waiting for
+	 * get_unqueued_pending() == extra
+	 */
+	wake_up(&conf->wait_barrier);
+	/* Wait for the barrier in same barrier unit bucket to drop. */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !conf->array_frozen &&
+			     !atomic_read(&conf->barrier[idx]),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -894,18 +943,32 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	int idx = sector_to_idx(sector_nr);
 
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen) {
-		conf->nr_waiting[idx]++;
-		/* Wait for array to unfreeze */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen,
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/*
+	 * Very similar to _wait_barrier(). The difference is, for read
+	 * I/O we don't need wait for sync I/O, but if the whole array
+	 * is frozen, the read I/O still has to wait until the array is
+	 * unfrozen. Since there is no ordering requirement with
+	 * conf->barrier[idx] here, memory barrier is unnecessary as well.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
 
-	conf->nr_pending[idx]++;
+	if (!READ_ONCE(conf->array_frozen))
+		return;
+
+	spin_lock_irq(&conf->resync_lock);
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/*
+	 * In case freeze_array() is waiting for
+	 * get_unqueued_pending() == extra
+	 */
+	wake_up(&conf->wait_barrier);
+	/* Wait for array to be unfrozen */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !conf->array_frozen,
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -926,11 +989,7 @@ static void wait_all_barriers(struct r1conf *conf)
 
 static void _allow_barrier(struct r1conf *conf, int idx)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -955,7 +1014,8 @@ static int get_unqueued_pending(struct r1conf *conf)
 	int idx, ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_pending[idx] - conf->nr_queued[idx];
+		ret += atomic_read(&conf->nr_pending[idx]) -
+			atomic_read(&conf->nr_queued[idx]);
 
 	return ret;
 }
@@ -1000,8 +1060,8 @@ static void unfreeze_array(struct r1conf *conf)
 	/* reverse the effect of the freeze */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 0;
-	wake_up(&conf->wait_barrier);
 	spin_unlock_irq(&conf->resync_lock);
+	wake_up(&conf->wait_barrier);
 }
 
 /* duplicate the data pages for behind I/O
@@ -2387,8 +2447,13 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
 		idx = sector_to_idx(r1_bio->sector);
-		conf->nr_queued[idx]++;
+		atomic_inc(&conf->nr_queued[idx]);
 		spin_unlock_irq(&conf->device_lock);
+		/*
+		 * In case freeze_array() is waiting for condition
+		 * get_unqueued_pending() == extra to be true.
+		 */
+		wake_up(&conf->wait_barrier);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
 		if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2525,9 +2590,7 @@ static void raid1d(struct md_thread *thread)
 						  retry_list);
 			list_del(&r1_bio->retry_list);
 			idx = sector_to_idx(r1_bio->sector);
-			spin_lock_irqsave(&conf->device_lock, flags);
-			conf->nr_queued[idx]--;
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			atomic_dec(&conf->nr_queued[idx]);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2549,7 +2612,7 @@ static void raid1d(struct md_thread *thread)
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
 		idx = sector_to_idx(r1_bio->sector);
-		conf->nr_queued[idx]--;
+		atomic_dec(&conf->nr_queued[idx]);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2666,7 +2729,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting[idx])
+	if (atomic_read(&conf->nr_waiting[idx]))
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2926,22 +2989,22 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		goto abort;
 
 	conf->nr_pending = kcalloc(BARRIER_BUCKETS_NR,
-				   sizeof(int), GFP_KERNEL);
+				   sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->nr_pending)
 		goto abort;
 
 	conf->nr_waiting = kcalloc(BARRIER_BUCKETS_NR,
-				   sizeof(int), GFP_KERNEL);
+				   sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->nr_waiting)
 		goto abort;
 
 	conf->nr_queued = kcalloc(BARRIER_BUCKETS_NR,
-				  sizeof(int), GFP_KERNEL);
+				  sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->nr_queued)
 		goto abort;
 
 	conf->barrier = kcalloc(BARRIER_BUCKETS_NR,
-				sizeof(int), GFP_KERNEL);
+				sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->barrier)
 		goto abort;
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 3442e8f..dd22a37 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -10,18 +10,19 @@
 /*
  * In struct r1conf, the following members are related to I/O barrier
  * buckets,
- *	int	*nr_pending;
- *	int	*nr_waiting;
- *	int	*nr_queued;
- *	int	*barrier;
- * Each of them points to array of integers, each array is designed to
- * have BARRIER_BUCKETS_NR elements and occupy a single memory page. The
- * data width of integer variables is 4, equal to 1<<(ilog2(sizeof(int))),
- * BARRIER_BUCKETS_NR_BITS is defined as (PAGE_SHIFT - ilog2(sizeof(int)))
- * to make sure an array of integers with BARRIER_BUCKETS_NR elements just
- * exactly occupies a single memory page.
+ *	atomic_t	*nr_pending;
+ *	atomic_t	*nr_waiting;
+ *	atomic_t	*nr_queued;
+ *	atomic_t	*barrier;
+ * Each of them points to array of atomic_t variables, each array is
+ * designed to have BARRIER_BUCKETS_NR elements and occupy a single
+ * memory page. The data width of atomic_t variables is 4 bytes, equal
+ * to 1<<(ilog2(sizeof(atomic_t))), BARRIER_BUCKETS_NR_BITS is defined
+ * as (PAGE_SHIFT - ilog2(sizeof(int))) to make sure an array of
+ * atomic_t variables with BARRIER_BUCKETS_NR elements just exactly
+ * occupies a single memory page.
  */
-#define BARRIER_BUCKETS_NR_BITS		(PAGE_SHIFT - ilog2(sizeof(int)))
+#define BARRIER_BUCKETS_NR_BITS		(PAGE_SHIFT - ilog2(sizeof(atomic_t)))
 #define BARRIER_BUCKETS_NR		(1<<BARRIER_BUCKETS_NR_BITS)
 
 struct raid1_info {
@@ -83,10 +84,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			*nr_pending;
-	int			*nr_waiting;
-	int			*nr_queued;
-	int			*barrier;
+	atomic_t		*nr_pending;
+	atomic_t		*nr_waiting;
+	atomic_t		*nr_queued;
+	atomic_t		*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
-- 
2.6.6


^ permalink raw reply related

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Shaohua Li @ 2017-02-17 19:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: colyli, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <87shnevcpr.fsf@notabene.neil.brown.name>

On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote:
> On Thu, Feb 16 2017, colyli@suse.de wrote:
> 
> > 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> > introduces a sliding resync window for raid1 I/O barrier, this idea limits
> > I/O barriers to happen only inside a slidingresync window, for regular
> > I/Os out of this resync window they don't need to wait for barrier any
> > more. On large raid1 device, it helps a lot to improve parallel writing
> > I/O throughput when there are background resync I/Os performing at
> > same time.
> >
> > The idea of sliding resync widow is awesome, but code complexity is a
> > challenge. Sliding resync window requires several veriables to work
> 
> variables
> 
> > collectively, this is complexed and very hard to make it work correctly.
> > Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
> > to fix the original resync window patch. This is not the end, any further
> > related modification may easily introduce more regreassion.
> >
> > Therefore I decide to implement a much simpler raid1 I/O barrier, by
> > removing resync window code, I believe life will be much easier.
> >
> > The brief idea of the simpler barrier is,
> >  - Do not maintain a logbal unique resync window
> 
> global
> 
> >  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
> 
> conflicts
> 
> >    I/O only has to wait for a resync I/O when both them have same barrier
> >    bucket index, vice versa.
> >  - I/O barrier can be recuded to an acceptable number if there are enought
> 
> reduced
> enough
> 
> >    barrier buckets
> >
> > Here I explain how the barrier buckets are designed,
> >  - BARRIER_UNIT_SECTOR_SIZE
> >    The whole LBA address space of a raid1 device is divided into multiple
> >    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
> >    Bio request won't go across border of barrier unit size, that means
> 
> requests
> 
> >    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
> >    For random I/O 64MB is large enough for both read and write requests,
> >    for sequential I/O considering underlying block layer may merge them
> >    into larger requests, 64MB is still good enough.
> >    Neil also points out that for resync operation, "we want the resync to
> >    move from region to region fairly quickly so that the slowness caused
> >    by having to synchronize with the resync is averaged out over a fairly
> >    small time frame". For full speed resync, 64MB should take less then 1
> >    second. When resync is competing with other I/O, it could take up a few
> >    minutes. Therefore 64MB size is fairly good range for resync.
> >
> >  - BARRIER_BUCKETS_NR
> >    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
> >         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
> >         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
> >    this patch makes the bellowed members of struct r1conf from integer
> >    to array of integers,
> >         -       int                     nr_pending;
> >         -       int                     nr_waiting;
> >         -       int                     nr_queued;
> >         -       int                     barrier;
> >         +       int                     *nr_pending;
> >         +       int                     *nr_waiting;
> >         +       int                     *nr_queued;
> >         +       int                     *barrier;
> >    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
> >    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
> >    barrier buckets, and each array of integers occupies single memory page.
> >    1024 means for a request which is smaller than the I/O barrier unit size
> >    has ~0.1% chance to wait for resync to pause, which is quite a small
> >    enough fraction. Also requesting single memory page is more friendly to
> >    kernel page allocator than larger memory size.
> >
> >  - I/O barrier bucket is indexed by bio start sector
> >    If multiple I/O requests hit different I/O barrier units, they only need
> >    to compete I/O barrier with other I/Os which hit the same I/O barrier
> >    bucket index with each other. The index of a barrier bucket which a
> >    bio should look for is calculated by sector_to_idx() which is defined
> >    in raid1.h as an inline function,
> >         static inline int sector_to_idx(sector_t sector)
> >         {
> >                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
> >                                 BARRIER_BUCKETS_NR_BITS);
> >         }
> >    Here sector_nr is the start sector number of a bio.
> 
> "hash_long() is used so that sequential writes in are region of the
> array which is not being resynced will not consistently align with
> the buckets that are being sequentially resynced, as described below"
> 
> >
> >  - Single bio won't go across boundary of a I/O barrier unit
> >    If a request goes across boundary of barrier unit, it will be split. A
> >    bio may be split in raid1_make_request() or raid1_sync_request(), if
> >    sectors returned by align_to_barrier_unit_end() is small than original
> 
> smaller
> 
> >    bio size.
> >
> > Comparing to single sliding resync window,
> >  - Currently resync I/O grows linearly, therefore regular and resync I/O
> >    will have confliction within a single barrier units. So the I/O
> 
> ... will conflict within ...
> 
> >    behavior is similar to single sliding resync window.
> >  - But a barrier unit bucket is shared by all barrier units with identical
> >    barrier uinit index, the probability of confliction might be higher
> >    than single sliding resync window, in condition that writing I/Os
> >    always hit barrier units which have identical barrier bucket indexs with
> >    the resync I/Os. This is a very rare condition in real I/O work loads,
> >    I cannot imagine how it could happen in practice.
> >  - Therefore we can achieve a good enough low confliction rate with much
> 
> ... low conflict rate ...
> 
> >    simpler barrier algorithm and implementation.
> >
> > There are two changes should be noticed,
> >  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
> >    single loop, it looks like this,
> >         spin_lock_irqsave(&conf->device_lock, flags);
> >         conf->nr_queued[idx]--;
> >         spin_unlock_irqrestore(&conf->device_lock, flags);
> >    This change generates more spin lock operations, but in next patch of
> >    this patch set, it will be replaced by a single line code,
> >         atomic_dec(&conf->nr_queueud[idx]);
> >    So we don't need to worry about spin lock cost here.
> >  - Mainline raid1 code split original raid1_make_request() into
> >    raid1_read_request() and raid1_write_request(). If the original bio
> >    goes across an I/O barrier unit size, this bio will be split before
> >    calling raid1_read_request() or raid1_write_request(),  this change
> >    the code logic more simple and clear.
> >  - In this patch wait_barrier() is moved from raid1_make_request() to
> >    raid1_write_request(). In raid_read_request(), original wait_barrier()
> >    is replaced by raid1_read_request().
> >    The differnece is wait_read_barrier() only waits if array is frozen,
> >    using different barrier function in different code path makes the code
> >    more clean and easy to read.
> 
> Thank you for putting the effort into writing a comprehensve change
> description.  I really appreciate it.
> 
> >  
> > @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
> >  
> >  static void raid1_make_request(struct mddev *mddev, struct bio *bio)
> >  {
> > -	struct r1conf *conf = mddev->private;
> > -	struct r1bio *r1_bio;
> > +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
> > +	struct bio *split;
> > +	sector_t sectors;
> >  
> > -	/*
> > -	 * make_request() can abort the operation when read-ahead is being
> > -	 * used and no empty request is available.
> > -	 *
> > -	 */
> > -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> > +	make_request_fn = (bio_data_dir(bio) == READ) ?
> > +			  raid1_read_request : raid1_write_request;
> >  
> > -	r1_bio->master_bio = bio;
> > -	r1_bio->sectors = bio_sectors(bio);
> > -	r1_bio->state = 0;
> > -	r1_bio->mddev = mddev;
> > -	r1_bio->sector = bio->bi_iter.bi_sector;
> > -
> > -	/*
> > -	 * We might need to issue multiple reads to different devices if there
> > -	 * are bad blocks around, so we keep track of the number of reads in
> > -	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
> > -	 * no locking will be needed when requests complete.  If it is
> > -	 * non-zero, then it is the number of not-completed requests.
> > -	 */
> > -	bio->bi_phys_segments = 0;
> > -	bio_clear_flag(bio, BIO_SEG_VALID);
> > +	/* if bio exceeds barrier unit boundary, split it */
> > +	do {
> > +		sectors = align_to_barrier_unit_end(
> > +				bio->bi_iter.bi_sector, bio_sectors(bio));
> > +		if (sectors < bio_sectors(bio)) {
> > +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> > +			bio_chain(split, bio);
> > +		} else {
> > +			split = bio;
> > +		}
> >  
> > -	if (bio_data_dir(bio) == READ)
> > -		raid1_read_request(mddev, bio, r1_bio);
> > -	else
> > -		raid1_write_request(mddev, bio, r1_bio);
> > +		make_request_fn(mddev, split);
> > +	} while (split != bio);
> >  }
> 
> I know you are going to change this as Shaohua wantsthe spitting
> to happen in a separate function, which I agree with, but there is
> something else wrong here.
> Calling bio_split/bio_chain repeatedly in a loop is dangerous.
> It is OK for simple devices, but when one request can wait for another
> request to the same device it can deadlock.
> This can happen with raid1.  If a resync request calls raise_barrier()
> between one request and the next, then the next has to wait for the
> resync request, which has to wait for the first request.
> As the first request will be stuck in the queue in
> generic_make_request(), you get a deadlock.
> It is much safer to:
> 
>     if (need to split) {
>         split = bio_split(bio, ...)
>         bio_chain(...)
>         make_request_fn(split);
>         generic_make_request(bio);
>    } else
>         make_request_fn(mddev, bio);
> 
> This way we first process the initial section of the bio (in 'split')
> which will queue some requests to the underlying devices.  These
> requests will be queued in generic_make_request.
> Then we queue the remainder of the bio, which will be added to the end
> of the generic_make_request queue.
> Then we return.
> generic_make_request() will pop the lower-level device requests off the
> queue and handle them first.  Then it will process the remainder
> of the original bio once the first section has been fully processed.

Good point! raid10 has the same problem. Looks this doesn't solve the issue for
device with 3 times stack though.

I knew you guys are working on this issue in block layer. Should we fix the
issue in MD side (for 2 stack devices) or wait for the block layer patch?

Thanks,
Shaohua 

^ permalink raw reply

* Re: [PATCH V4 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Shaohua Li @ 2017-02-17 20:00 UTC (permalink / raw)
  To: colyli
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <1487358357-123924-1-git-send-email-colyli@suse.de>

On Sat, Feb 18, 2017 at 03:05:56AM +0800, colyli@suse.de wrote:
> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
> 
> The idea of sliding resync widow is awesome, but code complexity is a
> challenge. Sliding resync window requires several variables to work
> collectively, this is complexed and very hard to make it work correctly.
> Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
> to fix the original resync window patch. This is not the end, any further
> related modification may easily introduce more regreassion.
> 
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
> 
> The brief idea of the simpler barrier is,
>  - Do not maintain a global unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflicts, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be reduced to an acceptable number if there are enough
>    barrier buckets
> 
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio requests won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
>    For random I/O 64MB is large enough for both read and write requests,
>    for sequential I/O considering underlying block layer may merge them
>    into larger requests, 64MB is still good enough.
>    Neil also points out that for resync operation, "we want the resync to
>    move from region to region fairly quickly so that the slowness caused
>    by having to synchronize with the resync is averaged out over a fairly
>    small time frame". For full speed resync, 64MB should take less then 1
>    second. When resync is competing with other I/O, it could take up a few
>    minutes. Therefore 64MB size is fairly good range for resync.
> 
>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>    this patch makes the bellowed members of struct r1conf from integer
>    to array of integers,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     *nr_pending;
>         +       int                     *nr_waiting;
>         +       int                     *nr_queued;
>         +       int                     *barrier;
>    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
>    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
>    barrier buckets, and each array of integers occupies single memory page.
>    1024 means for a request which is smaller than the I/O barrier unit size
>    has ~0.1% chance to wait for resync to pause, which is quite a small
>    enough fraction. Also requesting single memory page is more friendly to
>    kernel page allocator than larger memory size.
> 
>  - I/O barrier bucket is indexed by bio start sector
>    If multiple I/O requests hit different I/O barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same I/O barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by sector_to_idx() which is defined
>    in raid1.h as an inline function,
>         static inline int sector_to_idx(sector_t sector)
>         {
>                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
>                                 BARRIER_BUCKETS_NR_BITS);
>         }
>    Here sector_nr is the start sector number of a bio.
> 
>  - Single bio won't go across boundary of a I/O barrier unit
>    If a request goes across boundary of barrier unit, it will be split. A
>    bio may be split in raid1_make_request() or raid1_sync_request(), if
>    sectors returned by align_to_barrier_unit_end() is smaller than
>    original bio size.
> 
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will conflict within a single barrier units. So the I/O behavior is
>    similar to single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of conflict might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket indexs with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low conflict rate with much
>    simpler barrier algorithm and implementation.
> 
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(&conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Mainline raid1 code split original raid1_make_request() into
>    raid1_read_request() and raid1_write_request(). If the original bio
>    goes across an I/O barrier unit size, this bio will be split before
>    calling raid1_read_request() or raid1_write_request(),  this change
>    the code logic more simple and clear.
>  - In this patch wait_barrier() is moved from raid1_make_request() to
>    raid1_write_request(). In raid_read_request(), original wait_barrier()
>    is replaced by raid1_read_request().
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.
> Changelog
> V4:
> - Add alloc_r1bio() to remove redundant r1bio memory allocation code.
> - Fix many typos in patch comments.
> - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
> V3:
> - Rebase the patch against latest upstream kernel code.
> - Many fixes by review comments from Neil,
>   - Back to use pointers to replace arraries in struct r1conf
>   - Remove total_barriers from struct r1conf
>   - Add more patch comments to explain how/why the values of
>     BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
>   - Use get_unqueued_pending() to replace get_all_pendings() and
>     get_all_queued()
>   - Increase bucket number from 512 to 1024
> - Change code comments format by review from Shaohua.
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
>   bounday, to make the code more simple, by suggestion from Shaohua and
>   Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
>   confilict between resync I/O and sequential write I/O, by suggestion from
>   Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
> - In V1 patch the bellowed barrier buckets related members in r1conf are
>   allocated in memory page. To make the code more simple, V2 patch moves
>   the memory space into struct r1conf, like this,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>         +       int                     barrier[BARRIER_BUCKETS_NR];
>   This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
>   raid1_make_write_request().
> V1:
> - Original RFC patch for comments

Thanks, I applied the two patches. For the deadlock issue Neil pointed out, I
don't want it to hold this patch, we can fix it later after we figured out how
to fix. The r1bio allocation is guaranteed to success because it's mempool
allocation, so I deleted the warn statement.

Neil,

Since you reviewed the patches a lot I added your 'reviewed-by'. Please let me know if
you don't want to add it.

Thanks,
Shaohua

^ permalink raw reply

* Re: Process stuck in md_flush_request (state: D)
From: Shaohua Li @ 2017-02-17 20:06 UTC (permalink / raw)
  To: Les Stroud; +Cc: linux-raid
In-Reply-To: <99A92F4D-338D-4486-BB1C-C114A8524403@lesstroud.com>

On Fri, Feb 17, 2017 at 02:05:49PM -0500, Les Stroud wrote:
> 
> I have a problem with processes entering an uninterruptible sleep state in md_flush_request and never returning. I having trouble identifying the underlying issue. I’m hoping someone on here may be able to help.
> 
> The servers in question are running in aws (xen hvm) with kernel 3.8.13-118.16.2.el6uek.x86_64.  The server has two mounts.  The first is vanilla ext4.  The second is a software RAID0 array, striped with 256k chunks, buiIt with md.  It’s file system is ext4. 
> 
> The most immediately and obvious symptom of the issue are kernel errors “kernel: INFO task [some process]: blocked for more than 120 seconds.”.  Shortly there after, other processes start entering the same uninterruptible wait state (D). This almost always impacts ssh logins.
> 
> The problem does not occur when the system is under load, or was recently under load.  It happens when the system is quiet (no cpu, very little io).

This seems suggesting we have a missed blk-plug flush in light workload. Can
you check the output of /sys/block/disk-bame/inflight for both md and the
underlayer disks? This will let us know if there is IO pending.
Also it would be great if you can test a upstream kernel.

Thanks,
Shaohua

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox