Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH] md/raid1/10: fix potential deadlock
From: 王金浦 @ 2017-03-02 17:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, NeilBrown, Coly Li, v3.14+, only the raid10 part
In-Reply-To: <b5d7363c5bb03ee017d42f09cf16d9e41e0aa8eb.1488315832.git.shli@fb.com>

2017-02-28 22:08 GMT+01:00 Shaohua Li <shli@fb.com>:
> Neil Brown pointed out a potential deadlock in raid 10 code with
> bio_split/chain. The raid1 code could have the same issue, but recent
> barrier rework makes it less likely to happen. The deadlock happens in
> below sequence:
>
> 1. generic_make_request(bio), this will set current->bio_list
> 2. raid10_make_request will split bio to bio1 and bio2
> 3. __make_request(bio1), wait_barrer, add underlayer disk bio to
> current->bio_list
> 4. __make_request(bio2), wait_barrer
>
> If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
> raise_barrier waits for IO completion from 3. And since raise_barrier
> sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
> dispatched because raid10_make_request() doesn't finished yet.
>
> The solution is to adjust the IO ordering. Quotes from Neil:
> "
> It is much safer to:
>
>     if (need to split) {
>         split = bio_split(bio, ...)
>         bio_chain(...)
>         make_request_fn(split);
>         generic_make_request(bio);
>    } else
>         make_request_fn(mddev, bio);
>
> This way we first process the initial section of the bio (in 'split')
> which will queue some requests to the underlying devices.  These
> requests will be queued in generic_make_request.
> Then we queue the remainder of the bio, which will be added to the end
> of the generic_make_request queue.
> Then we return.
> generic_make_request() will pop the lower-level device requests off the
> queue and handle them first.  Then it will process the remainder
> of the original bio once the first section has been fully processed.
> "
>
> Cc: Coly Li <colyli@suse.de>
> Cc: 王金浦 <jinpuwang@gmail.com>
> Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid1.c  | 25 +++++++++++++++++++++++--
>  drivers/md/raid10.c | 18 ++++++++++++++++++
>  2 files changed, 41 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 551d654..3c5933b 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1584,9 +1584,30 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>                         split = bio;
>                 }
>
> -               if (bio_data_dir(split) == READ)
> +               if (bio_data_dir(split) == READ) {
>                         raid1_read_request(mddev, split);
> -               else
> +
> +                       /*
> +                        * If a bio is splitted, the first part of bio will
> +                        * pass barrier but the bio is queued in
> +                        * current->bio_list (see generic_make_request). If
> +                        * there is a raise_barrier() called here, the second
> +                        * part of bio can't pass barrier. But since the first
> +                        * part bio isn't dispatched to underlaying disks yet,
> +                        * the barrier is never released, hence raise_barrier
> +                        * will alays wait. We have a deadlock.
> +                        * Note, this only happens in read path. For write
> +                        * path, the first part of bio is dispatched in a
> +                        * schedule() call (because of blk plug) or offloaded
> +                        * to raid10d.
> +                        * Quitting from the function immediately can change
> +                        * the bio order queued in bio_list and avoid the deadlock.
> +                        */
> +                       if (split != bio) {
> +                               generic_make_request(bio);
> +                               break;
> +                       }
> +               } else
>                         raid1_write_request(mddev, split);
>         } while (split != bio);
>  }
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index c4db6d1..b1b1f98 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1584,7 +1584,25 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>                         split = bio;
>                 }
>
> +               /*
> +                * If a bio is splitted, the first part of bio will pass
> +                * barrier but the bio is queued in current->bio_list (see
> +                * generic_make_request). If there is a raise_barrier() called
> +                * here, the second part of bio can't pass barrier. But since
> +                * the first part bio isn't dispatched to underlaying disks
> +                * yet, the barrier is never released, hence raise_barrier will
> +                * alays wait. We have a deadlock.
> +                * Note, this only happens in read path. For write path, the
> +                * first part of bio is dispatched in a schedule() call
> +                * (because of blk plug) or offloaded to raid10d.
> +                * Quitting from the function immediately can change the bio
> +                * order queued in bio_list and avoid the deadlock.
> +                */
>                 __make_request(mddev, split);
> +               if (split != bio && bio_data_dir(bio) == READ) {
> +                       generic_make_request(bio);
> +                       break;
> +               }
>         } while (split != bio);
>
>         /* In case raid10d snuck in to freeze_array */
> --
> 2.9.3
>
Looks good to me!
Reviewed-by:  Jack Wang <jinpu.wang@profitbricks.com>

Thanks!

^ permalink raw reply

* Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages
From: Shaohua Li @ 2017-03-02 17:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVPqFyvsdwjdcHbKm8J8Ni=d5isSxguRVJZoF3m2kaO98Q@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:25:10AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
> >> Now we allocate one page array for managing resync pages, instead
> >> of using bio's vec table to do that, and the old way is very hacky
> >> and won't work any more if multipage bvec is enabled.
> >>
> >> The introduced cost is that we need to allocate (128 + 16) * raid_disks
> >> bytes per r1_bio, and it is fine because the inflight r1_bio for
> >> resync shouldn't be much, as pointed by Shaohua.
> >>
> >> Also the bio_reset() in raid1_sync_request() is removed because
> >> all bios are freshly new now and not necessary to reset any more.
> >>
> >> This patch can be thought as a cleanup too
> >>
> >> Suggested-by: Shaohua Li <shli@kernel.org>
> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> ---
> >>  drivers/md/raid1.c | 83 ++++++++++++++++++++++++++++++++++--------------------
> >>  1 file changed, 53 insertions(+), 30 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c442b4657e2f..900144f39630 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
> >>  #define raid1_log(md, fmt, args...)                          \
> >>       do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
> >>
> >> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
> >> +{
> >> +     return bio->bi_private;
> >> +}
> >> +
> >> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
> >> +{
> >> +     return get_resync_pages(bio)->raid_bio;
> >> +}
> >
> > This is a weird between bio, r1bio and the resync_pages. I'd like the pages are
> 
> It is only a bit weird inside allocating and freeing r1bio, once all
> are allocated, you
> can see everthing is clean and simple:
> 
>     - r1bio includes lots of bioes,
>     - and one bio is attached by one resync_pages via .bi_private

I don't how complex to let r1bio pointer to the pages, but that's the nartual
way. r1bio owns the pages, not the pages own r1bio, so we should let r1bio
points to the pages. The bio.bi_private still points to r1bio.

> > embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and more
> > straightforward.
> 
> In theory, the cleanest way is to allocate one resync_pages for each resync bio,
> but that isn't efficient. That is why this patch allocates all
> resync_pages together
> in r1buf_pool_alloc(), and split them into bio.
> 
> BTW, the only trick is just that the whole page array is stored in .bi_private
> of the 1st bio, then we can avoid to add one pointer into r1bio.

You will need to add pointer in resync_pages which points to r1bio
 
> >
> > I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will be broken.
> >
> 
> No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
> with reading from the pre-allocated page array.  Both should be fine, but the
> latter is cleaner and simpler to do.

Ok, sounds good, though I doubt it's really worthy.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()
From: Shaohua Li @ 2017-03-02 17:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVO3Nk4NgwTyD5eY9fYHAoshbGx=vnTFFQ_VpmED55=Bjg@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:11:54AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
> >> This patch gets each page's reference of each bio for resync,
> >> then r1buf_pool_free() gets simplified a lot.
> >>
> >> The same policy has been taken in raid10's buf pool allocation/free
> >> too.
> >
> > We are going to delete the code, this simplify isn't really required.
> 
> Could you explain it a bit? Or I can put this patch into this patchset if you
> can provide one?

I mean you will replace the code soon, with the resync_pages stuff. So I
thought this isn't really required.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way
From: Shaohua Li @ 2017-03-02 17:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVMJ-h+TwKT8eVbg3AxUjBzzEsCV0SeuDUaQ3FvFdfni3w@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:09:14AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
> >> Now resync I/O use bio's bec table to manage pages,
> >> this way is very hacky, and may not work any more
> >> once multipage bvec is introduced.
> >>
> >> So introduce helpers and new data structure for
> >> managing resync I/O pages more cleanly.
> >>
> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> ---
> >>  drivers/md/md.h | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 54 insertions(+)
> >>
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index 1d63239a1be4..b5a638d85cb4 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
> >>  #define RESYNC_BLOCK_SIZE (64*1024)
> >>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
> >>
> >> +/* for managing resync I/O pages */
> >> +struct resync_pages {
> >> +     unsigned        idx;    /* for get/put page from the pool */
> >> +     void            *raid_bio;
> >> +     struct page     *pages[RESYNC_PAGES];
> >> +};
> >
> > I'd like this to be embedded into r1bio directly. Not sure if we really need a
> > structure.
> 
> There are two reasons we can't put this into r1bio:
> - r1bio is used in both normal and resync I/O, not fair to allocate more
> in normal I/O, and that is why this patch wouldn't like to touch r1bio or r10bio
> 
> - the count of 'struct resync_pages' instance depends on raid_disks(raid1)
> or copies(raid10), both can't be decided during compiling.

Yes, I said it should be a pointer of r1bio pointing to the pages in other emails.
 
> >
> >> +}
> >> +
> >> +static inline bool resync_page_available(struct resync_pages *rp)
> >> +{
> >> +     return rp->idx < RESYNC_PAGES;
> >> +}
> >
> > Then we don't need this obscure API.
> 
> That is fine.
> 
> 
> Thanks,
> Ming Lei
 
> >
> >> +
> >> +static inline int resync_alloc_pages(struct resync_pages *rp,
> >> +                                  gfp_t gfp_flags)
> >> +{
> >> +     int i;
> >> +
> >> +     for (i = 0; i < RESYNC_PAGES; i++) {
> >> +             rp->pages[i] = alloc_page(gfp_flags);
> >> +             if (!rp->pages[i])
> >> +                     goto out_free;
> >> +     }
> >> +
> >> +     return 0;
> >> +
> >> + out_free:
> >> +     while (--i >= 0)
> >> +             __free_page(rp->pages[i]);
> >> +     return -ENOMEM;
> >> +}
> >> +
> >> +static inline void resync_free_pages(struct resync_pages *rp)
> >> +{
> >> +     int i;
> >> +
> >> +     for (i = 0; i < RESYNC_PAGES; i++)
> >> +             __free_page(rp->pages[i]);
> >
> > Since we will use get_page, shouldn't this be put_page?
> 
> You are right, will fix in v3.
> 
> >
> >> +}
> >> +
> >> +static inline void resync_get_all_pages(struct resync_pages *rp)
> >> +{
> >> +     int i;
> >> +
> >> +     for (i = 0; i < RESYNC_PAGES; i++)
> >> +             get_page(rp->pages[i]);
> >> +}
> >> +
> >> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
> >> +{
> >> +     if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
> >> +             return NULL;
> >> +     return rp->pages[rp->idx++];
> >
> > I'd like the caller explicitly specify the index instead of a global variable
> > to track it, which will make the code more understandable and less error prone.
> 
> That is fine, but the index has to be per bio, and finally the index
> has to be stored
> in 'struct resync_pages', so every user has to call it in the following way:
> 
>           resync_fetch_page(rp, rp->idx);
> 
> then looks no benefit to pass index explicitly.

I suppose the callers always iterate though page 0 to RESYNC_PAGES - 1. Then
the callers can use an index by themselves. That will clearly show which page
the callers are using. The resync_fetch_page() wrap is a blackbox we always
need to refer to the definition.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
From: Shaohua Li @ 2017-03-02 18:28 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb
In-Reply-To: <1488357760-7893-3-git-send-email-gqjiang@suse.com>

On Wed, Mar 01, 2017 at 04:42:37PM +0800, Guoqing Jiang wrote:
> Since we have switched to sync way to handle METADATA_UPDATED
> msg for md-cluster, then process_metadata_update is depended
> on mddev->thread->wqueue.
> 
> With the new change, clustered raid could possible hang if
> array received a METADATA_UPDATED msg after array unregistered
> mddev->thread, so we need to stop clustered raid earlier
> than before.
> 
> And this change should be safe for non-clustered raid since
> all writes are stopped before the destroy. Also in md_run,
> we activate the personality (pers->run()) before activating
> the bitmap (bitmap_create()). So it is pleasingly symmetric
> to stop the bitmap (bitmap_destroy()) before stopping the
> personality (__md_stop() calls pers->free()).
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/md.c | 30 +++++++++++++++++-------------
>  1 file changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 44206bc6e3aa..e1d9116044ae 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
>  	/* stop the array and free an attached data structures.
>  	 * This is called from dm-raid
>  	 */
> -	__md_stop(mddev);
>  	bitmap_destroy(mddev);
> +	__md_stop(mddev);
>  	if (mddev->bio_set)
>  		bioset_free(mddev->bio_set);
>  }

Applied other 4 patches. But this one I still have concerns.

For raid1, if a bio is behind IO, we return the bio to upper layer but don't
wait behind IO completion. So even there are no writes running, there might be
behind IO running. mddev_detach will do the wait which checks bitmap. If we
bitmap_destroy before __md_stop, mddev_detach doesn't do the wait.

Probably we should move mddev_detach out of __md_stop and always do:
mddev_detach()
bitmap_destroy()
__md_stop()

This looks safer to me.

Thanks,
Shaohua
> @@ -5688,6 +5688,22 @@ static int do_md_stop(struct mddev *mddev, int mode,
>  			set_disk_ro(disk, 0);
>  
>  		__md_stop_writes(mddev);
> +
> +		/*
> +		 * Destroy bitmap after all writes are stopped
> +		 */
> +		if (mode == 0) {
> +			bitmap_destroy(mddev);
> +			if (mddev->bitmap_info.file) {
> +				struct file *f = mddev->bitmap_info.file;
> +				spin_lock(&mddev->lock);
> +				mddev->bitmap_info.file = NULL;
> +				spin_unlock(&mddev->lock);
> +				fput(f);
> +			}
> +			mddev->bitmap_info.offset = 0;
> +		}
> +
>  		__md_stop(mddev);
>  		mddev->queue->backing_dev_info->congested_fn = NULL;
>  
> @@ -5712,19 +5728,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
>  	 */
>  	if (mode == 0) {
>  		pr_info("md: %s stopped.\n", mdname(mddev));
> -
> -		bitmap_destroy(mddev);
> -		if (mddev->bitmap_info.file) {
> -			struct file *f = mddev->bitmap_info.file;
> -			spin_lock(&mddev->lock);
> -			mddev->bitmap_info.file = NULL;
> -			spin_unlock(&mddev->lock);
> -			fput(f);
> -		}
> -		mddev->bitmap_info.offset = 0;
> -
>  		export_array(mddev);
> -
>  		md_clean(mddev);
>  		if (mddev->hold_active == UNTIL_STOP)
>  			mddev->hold_active = 0;
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: raid5 - adding journal to an existing device?
From: Shaohua Li @ 2017-03-02 18:39 UTC (permalink / raw)
  To: Tomasz Chmielewski; +Cc: linux-raid, songliubraving, jes.sorensen
In-Reply-To: <9447cf1f98aaade8ab5cd3f0ef724f79@admin.virtall.com>

On Sat, Feb 25, 2017 at 11:12:21PM +0900, Tomasz Chmielewski wrote:
> mdadm manual specifies the following for create, build and grow modes:
> 
>        --write-journal
>               Specify journal device for the RAID-4/5/6 array.
> 
> However:
> 
> # mdadm --grow /dev/md2 --write-journal /dev/sdc1
> mdadm: :option --write-journal not valid in grow mode
> 
> 
> Is it not yet supported, will never be supported, or am I simply trying to
> use it in a  wrong way?
> 
> 
> 
> Tried the following versions:
> 
> # ./mdadm -V
> mdadm - v3.4-115-ge22fe3a - 29th November 2016
> 
> # mdadm -V
> mdadm - v3.4 - 28th January 2016

Cc: Song and Jes.

I think we support hotadd journal. Will let Song clarify.

Thanks,
Shaohua

^ permalink raw reply

* Re: raid5 - adding journal to an existing device?
From: Song Liu @ 2017-03-02 19:19 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Tomasz Chmielewski, linux-raid@vger.kernel.org,
	jes.sorensen@gmail.com
In-Reply-To: <20170302183925.xnvvybpui6ldssv6@kernel.org>


> On Mar 2, 2017, at 10:39 AM, Shaohua Li <shli@kernel.org> wrote:
> 
> On Sat, Feb 25, 2017 at 11:12:21PM +0900, Tomasz Chmielewski wrote:
>> mdadm manual specifies the following for create, build and grow modes:
>> 
>>       --write-journal
>>              Specify journal device for the RAID-4/5/6 array.
>> 
>> However:
>> 
>> # mdadm --grow /dev/md2 --write-journal /dev/sdc1
>> mdadm: :option --write-journal not valid in grow mode
>> 
>> 
>> Is it not yet supported, will never be supported, or am I simply trying to
>> use it in a  wrong way?
>> 
>> 
>> 
>> Tried the following versions:
>> 
>> # ./mdadm -V
>> mdadm - v3.4-115-ge22fe3a - 29th November 2016
>> 
>> # mdadm -V
>> mdadm - v3.4 - 28th January 2016
> 
> Cc: Song and Jes.
> 
> I think we support hotadd journal. Will let Song clarify.
> 
> Thanks,
> Shaohua

We have support for hotadd journal to an array with broken/missing journal. 
We haven't implement hotted journal to arrays with no journal ever. We do plan
to add the feature. 

Thanks,
Song

^ permalink raw reply

* Re: [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
From: NeilBrown @ 2017-03-02 22:15 UTC (permalink / raw)
  To: Shaohua Li, Guoqing Jiang; +Cc: linux-raid, shli, neilb
In-Reply-To: <20170302182825.fcinhanfai3qgzot@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 4536 bytes --]

On Thu, Mar 02 2017, Shaohua Li wrote:

> On Wed, Mar 01, 2017 at 04:42:37PM +0800, Guoqing Jiang wrote:
>> Since we have switched to sync way to handle METADATA_UPDATED
>> msg for md-cluster, then process_metadata_update is depended
>> on mddev->thread->wqueue.
>> 
>> With the new change, clustered raid could possible hang if
>> array received a METADATA_UPDATED msg after array unregistered
>> mddev->thread, so we need to stop clustered raid earlier
>> than before.
>> 
>> And this change should be safe for non-clustered raid since
>> all writes are stopped before the destroy. Also in md_run,
>> we activate the personality (pers->run()) before activating
>> the bitmap (bitmap_create()). So it is pleasingly symmetric
>> to stop the bitmap (bitmap_destroy()) before stopping the
>> personality (__md_stop() calls pers->free()).
>> 
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>  drivers/md/md.c | 30 +++++++++++++++++-------------
>>  1 file changed, 17 insertions(+), 13 deletions(-)
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 44206bc6e3aa..e1d9116044ae 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
>>  	/* stop the array and free an attached data structures.
>>  	 * This is called from dm-raid
>>  	 */
>> -	__md_stop(mddev);
>>  	bitmap_destroy(mddev);
>> +	__md_stop(mddev);
>>  	if (mddev->bio_set)
>>  		bioset_free(mddev->bio_set);
>>  }
>
> Applied other 4 patches. But this one I still have concerns.
>
> For raid1, if a bio is behind IO, we return the bio to upper layer but don't
> wait behind IO completion. So even there are no writes running, there might be
> behind IO running. mddev_detach will do the wait which checks bitmap. If we
> bitmap_destroy before __md_stop, mddev_detach doesn't do the wait.
>
> Probably we should move mddev_detach out of __md_stop and always do:
> mddev_detach()
> bitmap_destroy()
> __md_stop()
>
> This looks safer to me.

Thanks for catching that.  I agree - mddev_detach should come before
bitmap_destroy.
I might be best to change __md_stop() to start

static void __md_stop(struct mddev *mddev)
{
	struct md_personality *pers = mddev->pers;
	mddev_detach(mddev);
+	bitmap_destroy(mddev);
	/* Ensure ->event_work is done */
	flush_workqueue(md_misc_wq);


That would make the remainder of the patch (below) unnecessary.
I think it is wrong anyway.
We were correct to move the "bitmap_destroy() call up to the
"if (mddev->pers)" case, but we were not correct to move the
closing of bitmap_info.file.
If a file is added to an array but that array is not started
(so ->pers is not set), then stopping the array will not close the file
with the change below, and that isn't good.

So if we move bitmap_destroy() into  __md_stop() and remove it from
do_md_stop and md_stop(), that might be exactly what we need.

Thanks,
NeilBrown

>
> Thanks,
> Shaohua
>> @@ -5688,6 +5688,22 @@ static int do_md_stop(struct mddev *mddev, int mode,
>>  			set_disk_ro(disk, 0);
>>  
>>  		__md_stop_writes(mddev);
>> +
>> +		/*
>> +		 * Destroy bitmap after all writes are stopped
>> +		 */
>> +		if (mode == 0) {
>> +			bitmap_destroy(mddev);
>> +			if (mddev->bitmap_info.file) {
>> +				struct file *f = mddev->bitmap_info.file;
>> +				spin_lock(&mddev->lock);
>> +				mddev->bitmap_info.file = NULL;
>> +				spin_unlock(&mddev->lock);
>> +				fput(f);
>> +			}
>> +			mddev->bitmap_info.offset = 0;
>> +		}
>> +
>>  		__md_stop(mddev);
>>  		mddev->queue->backing_dev_info->congested_fn = NULL;
>>  
>> @@ -5712,19 +5728,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
>>  	 */
>>  	if (mode == 0) {
>>  		pr_info("md: %s stopped.\n", mdname(mddev));
>> -
>> -		bitmap_destroy(mddev);
>> -		if (mddev->bitmap_info.file) {
>> -			struct file *f = mddev->bitmap_info.file;
>> -			spin_lock(&mddev->lock);
>> -			mddev->bitmap_info.file = NULL;
>> -			spin_unlock(&mddev->lock);
>> -			fput(f);
>> -		}
>> -		mddev->bitmap_info.offset = 0;
>> -
>>  		export_array(mddev);
>> -
>>  		md_clean(mddev);
>>  		if (mddev->hold_active == UNTIL_STOP)
>>  			mddev->hold_active = 0;
>> -- 
>> 2.6.2
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH RFC] test: revise 'test' and make it easier to understand
From: NeilBrown @ 2017-03-02 22:26 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, gqjiang, Zhilong Liu
In-Reply-To: <1488250053-10436-1-git-send-email-zlliu@suse.com>

[-- Attachment #1: Type: text/plain, Size: 1674 bytes --]

On Tue, Feb 28 2017, Zhilong Liu wrote:

> 1. use 'Tab' as the code style.
> 2. arrange the testing steps and provide the 'main' entrance.
> 3. draft the log_save feature, it captures the /proc/mdstat,
>    md superblock info, bitmap info and the detail dmesg.
> 4. modified the mdadm() func, adding the operation that clear
>    the superblock when create or build one new array, and it
>    would exit testing when mdadm command returned non-0 value.
> 5. delete no_errors() func, it only used in tests/04update-uuid,
>    I recommend the new mdadm() using method.
> 6. delete fast_sync() func.
> 7. testdev(), add the object file checking, otherwise this command
>    would create one regular file, it's one trouble thing.
> 8. add dmesg checking in do_test() func, it's necessary to check
>    dmesg whether or not printed abnormal message.
> 9. add checking conditions in main(), such as $pwd/raid6check need
>    exists, here is a prompt to remind users to 'make everything'
>    before testing; the $targetdir should mount under ext[2-4] FS,
>    because the external bitmap only supports ext, the bmap() API
>    of bitmap.c doesn't exist in all filesystem, such as btrfs.
>
> Signed-off-by: Zhilong Liu <zlliu@suse.com>

I haven't looked at this patch in any great detail, but I just wanted to
say I think it is great that 'test' is getting some attention like this.
I sort of just grew over the years without any clear plan.  It probably
has all sorts of strange things in it that don't make a lot of sense any
more.
So I've very happy for you to rip it apart and put it together is a more
coherent form.  Thanks!

Acked-by: NeilBrown <neilb@suse.com>

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* RE: GRUB warning after replacing disk drive in RAID1
From: Peter Sangas @ 2017-03-02 23:00 UTC (permalink / raw)
  To: 'Phil Turmel', 'Reindl Harald', linux-raid
In-Reply-To: <ce5ec33c-c38b-de76-1976-bea9e7caba68@turmel.org>

> Correct.  But it shouldn't be able to boot from the others either, with grub1.
> Something has changed in your grub install since you set up the original drives.  Only
> grub2 would be able to boot from your v1.2 md0.
> 
> { Sorry.  I don't know how to fix grub2's md module. }
> 

Just to so I'm crystal clear : the server needs to have grub2 installed so doesn't this confirm grub2 is installed.

grub-install -V
grub-install (GRUB) 2.02~beta2-36ubuntu3.2

and since I'm running Ubuntu 16.04.1 LTS

from https://help.ubuntu.com/community/Grub2
"GRUB 2 is the default boot loader and manager for Ubuntu since version 9.10"
"GRUB 2 is version 1.98 or later"

What Am I missing?


^ permalink raw reply

* [PATCH - mdadm] examine: tidy up some code.
From: NeilBrown @ 2017-03-02 23:57 UTC (permalink / raw)
  To: Jes.Sorensen; +Cc: linux-raid, Michael Shigorin

[-- Attachment #1: Type: text/plain, Size: 3930 bytes --]


Michael Shigorin reports that the 'lcc' compiler isn't able
to deduce that 'st' must be initialized in

		if (c->SparcAdjust)
			st->ss->update_super(st, NULL, "sparc2.2",

just because the only times it isn't initialised, 'err' is set non-zero.

This results in a 'possibly uninitialised' warning.
While there is no bug in the code, this does suggest that maybe
the code could be made more obviously correct.

So this patch:
 1/ moves the "err" variable inside the for loop, so an error in
    one device doesn't stop the other devices from being processed
 2/ calls 'continue' early if the device cannot be opened, so that
    a level of indent can be removed, and so that it is clear that
    'st' is always initialised before being used
 3/ frees 'st' if an error occured in load_super or load_container.

Reported-by: Michael Shigorin <mike@altlinux.org>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 Examine.c | 75 +++++++++++++++++++++++++++++++++------------------------------
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/Examine.c b/Examine.c
index 953b8eee2360..7013480d6dd8 100644
--- a/Examine.c
+++ b/Examine.c
@@ -53,7 +53,6 @@ int Examine(struct mddev_dev *devlist,
 	 */
 	int fd;
 	int rv = 0;
-	int err = 0;
 
 	struct array {
 		struct supertype *st;
@@ -66,6 +65,8 @@ int Examine(struct mddev_dev *devlist,
 	for (; devlist ; devlist = devlist->next) {
 		struct supertype *st;
 		int have_container = 0;
+		int err = 0;
+		int container = 0;
 
 		fd = dev_open(devlist->devname, O_RDONLY);
 		if (fd < 0) {
@@ -74,44 +75,46 @@ int Examine(struct mddev_dev *devlist,
 				       devlist->devname, strerror(errno));
 				rv = 1;
 			}
-			err = 1;
+			continue;
 		}
-		else {
-			int container = 0;
-			if (forcest)
-				st = dup_super(forcest);
-			else if (must_be_container(fd)) {
-				/* might be a container */
-				st = super_by_fd(fd, NULL);
-				container = 1;
-			} else
-				st = guess_super(fd);
-			if (st) {
-				err = 1;
-				st->ignore_hw_compat = 1;
-				if (!container)
-					err = st->ss->load_super(st, fd,
-								 (c->brief||c->scan) ? NULL
-								 :devlist->devname);
-				if (err && st->ss->load_container) {
-					err = st->ss->load_container(st, fd,
-								 (c->brief||c->scan) ? NULL
-								 :devlist->devname);
-					if (!err)
-						have_container = 1;
-				}
-				st->ignore_hw_compat = 0;
-			} else {
-				if (!c->brief) {
-					pr_err("No md superblock detected on %s.\n", devlist->devname);
-					rv = 1;
-				}
-				err = 1;
+
+		if (forcest)
+			st = dup_super(forcest);
+		else if (must_be_container(fd)) {
+			/* might be a container */
+			st = super_by_fd(fd, NULL);
+			container = 1;
+		} else
+			st = guess_super(fd);
+		if (st) {
+			err = 1;
+			st->ignore_hw_compat = 1;
+			if (!container)
+				err = st->ss->load_super(st, fd,
+							 (c->brief||c->scan) ? NULL
+							 :devlist->devname);
+			if (err && st->ss->load_container) {
+				err = st->ss->load_container(st, fd,
+							     (c->brief||c->scan) ? NULL
+							     :devlist->devname);
+				if (!err)
+					have_container = 1;
 			}
-			close(fd);
+			st->ignore_hw_compat = 0;
+		} else {
+			if (!c->brief) {
+				pr_err("No md superblock detected on %s.\n", devlist->devname);
+				rv = 1;
+			}
+			err = 1;
 		}
-		if (err)
+		close(fd);
+
+		if (err) {
+			if (st)
+				st->ss->free_super(st);
 			continue;
+		}
 
 		if (c->SparcAdjust)
 			st->ss->update_super(st, NULL, "sparc2.2",
@@ -121,7 +124,7 @@ int Examine(struct mddev_dev *devlist,
 		if (c->brief && st->ss->brief_examine_super == NULL) {
 			if (!c->scan)
 				pr_err("No brief listing for %s on %s\n",
-					st->ss->name, devlist->devname);
+				       st->ss->name, devlist->devname);
 		} else if (c->brief) {
 			struct array *ap;
 			char *d;
-- 
2.11.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: Reshape stalled at first badblock location (was: RAID 5 --assemble doesn't recognize all overlays as component devices)
From: NeilBrown @ 2017-03-03  0:27 UTC (permalink / raw)
  To: Shaohua Li, George Rapp; +Cc: Linux-RAID, Matthew Krumwiede, Jes.Sorensen
In-Reply-To: <20170221175801.wt64t2tzcvg3sfmc@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 916 bytes --]

On Tue, Feb 21 2017, Shaohua Li wrote:

>
> Add Neil and Jes.
>
> Yes, there were similar reports before. When reshape finds nadblocks, the
> reshape will do an infinite loop without any progress. I think there are two
> things we need to do:
>
> - Make reshape more robust. Maybe reshape should bail out if badblocks found.
> - Add an option in mdadm to force reset badblocks

The second of these is already possible
Commit: 6dd16dac4001 ("Add --update=force-no-bbl.")

It isn't documented though, and only works during "assemble", not on an
active array.
Writing to the "bad" blocks should remove the "bad" status.  It would be
nice if mdadm could locate the bad blocks, map them to array blocks,
trigger a limited "resync" if there are any good copies, or write zeros
if there aren't.

And yes; reshape should be more robust...  if only we had a pool of
developers, eager to work on these problems :-)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [md PATCH 10/14] md/raid1: stop using bi_phys_segment
From: NeilBrown @ 2017-03-03  0:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Christoph Hellwig
In-Reply-To: <CACVXFVPf6qoaEWcv_s4n=2V9E=Jqfv3QE+ZpXVALdiiWTjh63Q@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2165 bytes --]

On Tue, Feb 21 2017, Ming Lei wrote:

> On Tue, Feb 21, 2017 at 8:05 AM, NeilBrown <neilb@suse.com> wrote:
>> On Mon, Feb 20 2017, Ming Lei wrote:
>>
>>> On Thu, Feb 16, 2017 at 12:39 PM, NeilBrown <neilb@suse.com> wrote:
>>>>
>>>> +static void inc_pending(struct r1conf *conf, sector_t start_next_window,
>>>> +                       sector_t bi_sector)
>>>> +{
>>>> +       /* The current request requires multiple r1_bio, so
>>>> +        * we need to increment the pending count, and the corresponding
>>>> +        * window count.
>>>> +        */
>>>> +       spin_lock(&conf->resync_lock);
>>>> +       conf->nr_pending++;
>>>
>>> Just be curious, in current code 'nr_pending' is increased in wait_barrier(),
>>> and looks this patch introduces inc_pending() to do that on each r10_bio, but
>>> not see any change in wait_barrier(), so that means there might be issue in
>>> current implementation about operating on this counter?
>>
>> Did you read the more detailed description in the previous raid10.c
>> patch?
>> This patch follows the same logic as that patch.
>
> OK, I see the point now:
>
> - for the 1st r1_bio, conf->nr_pending is increased in wait_barrier()
> - for the others, conf->nr_pending is increased in inc_pending().
>
> Also I have another question:
>
> - before this patch, both number of requests in windows
> are increased only for WRITE I/O(see wait_barrier()), and decreased
> for both READ/WRITE in complete path(see allow_barrier())

For a READ request, ->start_next_window is zero, so allow_barrier()
doesn't decrease the window counters.
So they are only increased and decreased for WRITE request.

>
> - after this patch, except for the 1st r1_bio, number of requests in
> windows are increased for both WRITE/READ I/O, and decreased
> for both READ/WRITE too.

Why do you think READ requests now increase the number of requests in a
window?

NeilBrown

>
> Could you explain a bit about this change?
>
> Thanks,
> Ming Lei
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way
From: Ming Lei @ 2017-03-03  1:54 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302175543.ta2tmjveatjiiapc@kernel.org>

On Fri, Mar 3, 2017 at 1:55 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:09:14AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
>> >> Now resync I/O use bio's bec table to manage pages,
>> >> this way is very hacky, and may not work any more
>> >> once multipage bvec is introduced.
>> >>
>> >> So introduce helpers and new data structure for
>> >> managing resync I/O pages more cleanly.
>> >>
>> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> >> ---
>> >>  drivers/md/md.h | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>  1 file changed, 54 insertions(+)
>> >>
>> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> >> index 1d63239a1be4..b5a638d85cb4 100644
>> >> --- a/drivers/md/md.h
>> >> +++ b/drivers/md/md.h
>> >> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
>> >>  #define RESYNC_BLOCK_SIZE (64*1024)
>> >>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
>> >>
>> >> +/* for managing resync I/O pages */
>> >> +struct resync_pages {
>> >> +     unsigned        idx;    /* for get/put page from the pool */
>> >> +     void            *raid_bio;
>> >> +     struct page     *pages[RESYNC_PAGES];
>> >> +};
>> >
>> > I'd like this to be embedded into r1bio directly. Not sure if we really need a
>> > structure.
>>
>> There are two reasons we can't put this into r1bio:
>> - r1bio is used in both normal and resync I/O, not fair to allocate more
>> in normal I/O, and that is why this patch wouldn't like to touch r1bio or r10bio
>>
>> - the count of 'struct resync_pages' instance depends on raid_disks(raid1)
>> or copies(raid10), both can't be decided during compiling.
>
> Yes, I said it should be a pointer of r1bio pointing to the pages in other emails.

OK, got it, :-)

Actually I did that in my local tree, and finally I figured out not
necessary to introduce this pointor, which only usage is for freeing,
and we can get the
pointor from .bi_private of the 1st bio. And this stuff can be commented
clearly.

>
>> >
>> >> +}
>> >> +
>> >> +static inline bool resync_page_available(struct resync_pages *rp)
>> >> +{
>> >> +     return rp->idx < RESYNC_PAGES;
>> >> +}
>> >
>> > Then we don't need this obscure API.
>>
>> That is fine.
>>
>>
>> Thanks,
>> Ming Lei
>
>> >
>> >> +
>> >> +static inline int resync_alloc_pages(struct resync_pages *rp,
>> >> +                                  gfp_t gfp_flags)
>> >> +{
>> >> +     int i;
>> >> +
>> >> +     for (i = 0; i < RESYNC_PAGES; i++) {
>> >> +             rp->pages[i] = alloc_page(gfp_flags);
>> >> +             if (!rp->pages[i])
>> >> +                     goto out_free;
>> >> +     }
>> >> +
>> >> +     return 0;
>> >> +
>> >> + out_free:
>> >> +     while (--i >= 0)
>> >> +             __free_page(rp->pages[i]);
>> >> +     return -ENOMEM;
>> >> +}
>> >> +
>> >> +static inline void resync_free_pages(struct resync_pages *rp)
>> >> +{
>> >> +     int i;
>> >> +
>> >> +     for (i = 0; i < RESYNC_PAGES; i++)
>> >> +             __free_page(rp->pages[i]);
>> >
>> > Since we will use get_page, shouldn't this be put_page?
>>
>> You are right, will fix in v3.
>>
>> >
>> >> +}
>> >> +
>> >> +static inline void resync_get_all_pages(struct resync_pages *rp)
>> >> +{
>> >> +     int i;
>> >> +
>> >> +     for (i = 0; i < RESYNC_PAGES; i++)
>> >> +             get_page(rp->pages[i]);
>> >> +}
>> >> +
>> >> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
>> >> +{
>> >> +     if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
>> >> +             return NULL;
>> >> +     return rp->pages[rp->idx++];
>> >
>> > I'd like the caller explicitly specify the index instead of a global variable
>> > to track it, which will make the code more understandable and less error prone.
>>
>> That is fine, but the index has to be per bio, and finally the index
>> has to be stored
>> in 'struct resync_pages', so every user has to call it in the following way:
>>
>>           resync_fetch_page(rp, rp->idx);
>>
>> then looks no benefit to pass index explicitly.
>
> I suppose the callers always iterate though page 0 to RESYNC_PAGES - 1. Then
> the callers can use an index by themselves. That will clearly show which page
> the callers are using. The resync_fetch_page() wrap is a blackbox we always
> need to refer to the definition.

OK, understood.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()
From: Ming Lei @ 2017-03-03  1:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302174942.ad3lsrqgvi2erkrl@kernel.org>

On Fri, Mar 3, 2017 at 1:49 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:11:54AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
>> >> This patch gets each page's reference of each bio for resync,
>> >> then r1buf_pool_free() gets simplified a lot.
>> >>
>> >> The same policy has been taken in raid10's buf pool allocation/free
>> >> too.
>> >
>> > We are going to delete the code, this simplify isn't really required.
>>
>> Could you explain it a bit? Or I can put this patch into this patchset if you
>> can provide one?
>
> I mean you will replace the code soon, with the resync_pages stuff. So I
> thought this isn't really required.

OK,  one benefit of this patch is to make the following one easier to review,
but not a big deal, I may merge this into the following patch.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages
From: Ming Lei @ 2017-03-03  2:11 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302174803.ds45hdbzkofl5hkh@kernel.org>

On Fri, Mar 3, 2017 at 1:48 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:25:10AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
>> >> Now we allocate one page array for managing resync pages, instead
>> >> of using bio's vec table to do that, and the old way is very hacky
>> >> and won't work any more if multipage bvec is enabled.
>> >>
>> >> The introduced cost is that we need to allocate (128 + 16) * raid_disks
>> >> bytes per r1_bio, and it is fine because the inflight r1_bio for
>> >> resync shouldn't be much, as pointed by Shaohua.
>> >>
>> >> Also the bio_reset() in raid1_sync_request() is removed because
>> >> all bios are freshly new now and not necessary to reset any more.
>> >>
>> >> This patch can be thought as a cleanup too
>> >>
>> >> Suggested-by: Shaohua Li <shli@kernel.org>
>> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> >> ---
>> >>  drivers/md/raid1.c | 83 ++++++++++++++++++++++++++++++++++--------------------
>> >>  1 file changed, 53 insertions(+), 30 deletions(-)
>> >>
>> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> >> index c442b4657e2f..900144f39630 100644
>> >> --- a/drivers/md/raid1.c
>> >> +++ b/drivers/md/raid1.c
>> >> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>> >>  #define raid1_log(md, fmt, args...)                          \
>> >>       do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
>> >>
>> >> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
>> >> +{
>> >> +     return bio->bi_private;
>> >> +}
>> >> +
>> >> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
>> >> +{
>> >> +     return get_resync_pages(bio)->raid_bio;
>> >> +}
>> >
>> > This is a weird between bio, r1bio and the resync_pages. I'd like the pages are
>>
>> It is only a bit weird inside allocating and freeing r1bio, once all
>> are allocated, you
>> can see everthing is clean and simple:
>>
>>     - r1bio includes lots of bioes,
>>     - and one bio is attached by one resync_pages via .bi_private
>
> I don't how complex to let r1bio pointer to the pages, but that's the nartual
> way. r1bio owns the pages, not the pages own r1bio, so we should let r1bio
> points to the pages. The bio.bi_private still points to r1bio.

Actually it is bio which owns the pages for doing its own I/O, and the only
thing related with r10bio is that bios may share these pages, but using
page refcount trick will make the relation quite implicit.

The only reason to allocate all resync_pages together is for sake of efficiency,
and just for avoiding to allocate one resync_pages one time for each bio.

We have to make .bi_private point to resync_pages(per bio), otherwise we
can't fetch pages into one bio at all, thinking about where to store the index
for each bio's pre-allocated pages, and it has to be per bio.

>
>> > embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and more
>> > straightforward.
>>
>> In theory, the cleanest way is to allocate one resync_pages for each resync bio,
>> but that isn't efficient. That is why this patch allocates all
>> resync_pages together
>> in r1buf_pool_alloc(), and split them into bio.
>>
>> BTW, the only trick is just that the whole page array is stored in .bi_private
>> of the 1st bio, then we can avoid to add one pointer into r1bio.
>
> You will need to add pointer in resync_pages which points to r1bio

Yeah, that is just what this patchset is doing.

>
>> >
>> > I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will be broken.
>> >
>>
>> No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
>> with reading from the pre-allocated page array.  Both should be fine, but the
>> latter is cleaner and simpler to do.
>
> Ok, sounds good, though I doubt it's really worthy.

Small patch with one purpose is always easy to review, do one thing,
do it better,
:-)


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Ming Lei @ 2017-03-03  2:20 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302075202.ipnx64sodjzxisqs@kernel.org>

On Thu, Mar 2, 2017 at 3:52 PM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>> >> Use this helper, instead of direct access to .bi_vcnt.
>> >
>> > what We really need to do for the behind IO is:
>> > - allocate memory and copy bio data to the memory
>> > - let behind bio do IO against the memory
>> >
>> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
>> > just track the new memory, we don't need use the bio_segments_all and access
>> > bio_vec too.
>>
>> But we need to figure out how many vecs(each vec store one page) to be
>> allocated for the cloned/behind bio, and that is the only value of
>> bio_segments_all() here. Or you have idea to avoid that?
>
> As I said, the behind bio doesn't need to have the exactly same bio_vec
> setting. We just allocate memory and copy original bio data to the memory,
> then do IO against the new memory. The behind bio
> segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT

The equation isn't always correct, especially when bvec includes just
part of page, and it is quite often in case of mkfs, in which one bvec often
includes 512byte buffer.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error
From: Ming Lei @ 2017-03-03  2:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302074753.6zsdspg5e67l24xu@kernel.org>

On Thu, Mar 2, 2017 at 3:47 PM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:37:49AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
>> >> The cost is 128bytes(8*16) stack space in kernel thread context, and
>> >> just use the bio helper to retrieve pages from bio.
>> >>
>> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> >> ---
>> >>  drivers/md/raid10.c | 12 ++++++++++--
>> >>  1 file changed, 10 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> >> index 0b97631e3905..6ffb64ab45f8 100644
>> >> --- a/drivers/md/raid10.c
>> >> +++ b/drivers/md/raid10.c
>> >> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev *mddev,
>> >>       struct r10bio *r10b = &on_stack.r10_bio;
>> >>       int slot = 0;
>> >>       int idx = 0;
>> >> -     struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
>> >> +     struct bio_vec *bvl;
>> >> +     struct page *pages[RESYNC_PAGES];
>> >> +
>> >> +     /*
>> >> +      * This bio is allocated in reshape_request(), and size
>> >> +      * is still RESYNC_PAGES
>> >> +      */
>> >> +     bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
>> >> +             pages[idx] = bvl->bv_page;
>> >
>> > The reshape bio is doing IO against the memory we allocated for r10_bio, I'm
>> > wondering why we can't get the pages from r10_bio. In this way, we don't need
>> > access the bio_vec any more.
>>
>> Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
>> .r10buf_pool, please see reshape_request().
>
> Why does this matter? The bio is still doing IO to the memory allocated in
> r10_bio. bi_private->r10_bio->the pages.

Good question, :-)

>
> My guess why you think the reshape read is special is the bio->bi_private
> doesn't pointer to resync_pages. And since you let resync_pages pointer to
> r10_bio and not vice versa, we can't find r10_bio, correct? I think this is
> another reason why r10_bio embeds resync_pages (I mean a pointer to
> resync_pages). With this way I suggested, the reshape read isn't special at

We still can get the resync_pages for r10_bio->master_bio via
r10_bio->devs[0].bio in handle_reshape_read_error(), will do that in v3.

> all. If we allocate memory for r10_bio/r1_bio, we shouldn't need access any bio
> bvec.

Right.



Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH 2/3] md/raid5-cache: bump flush stripe batch size
From: NeilBrown @ 2017-03-03  3:03 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: songliubraving, kernel-team
In-Reply-To: <52a5e6f8924b9a19d4ec4ad73f6e7087f36f139d.1487373517.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 1466 bytes --]

On Fri, Feb 17 2017, Shaohua Li wrote:

> Bump the flush stripe batch size to 2048. For my 12 disks raid
> array, the stripes takes:
> 12 * 4k * 2048 = 96MB
>
> This is still quite small. A hardware raid card generally has 1GB size,
> which we suggest the raid5-cache has similar cache size.
>
> The advantage of a big batch size is we can dispatch a lot of IO in the
> same time, then we can do some scheduling to make better IO pattern.
>
> Last patch prioritizes stripes, so we don't worry about a big flush
> stripe batch will starve normal stripes.
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5-cache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 3f307be..b25512c 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -43,7 +43,7 @@
>  /* wake up reclaim thread periodically */
>  #define R5C_RECLAIM_WAKEUP_INTERVAL (30 * HZ)
>  /* start flush with these full stripes */
> -#define R5C_FULL_STRIPE_FLUSH_BATCH 256
> +#define R5C_FULL_STRIPE_FLUSH_BATCH 2048

Fixed numbers are warning signs... I wonder if there is something better
we could do?   "conf->max_nr_stripes / 4" maybe?  We use that sort of
number elsewhere.
Would that make sense?

Thanks,
NeilBrown

>  /* reclaim stripes in groups */
>  #define R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2)
>  
> -- 
> 2.9.3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
From: Guoqing Jiang @ 2017-03-03  3:08 UTC (permalink / raw)
  To: NeilBrown, Shaohua Li; +Cc: linux-raid, shli, neilb
In-Reply-To: <87pohznx49.fsf@notabene.neil.brown.name>



On 03/03/2017 06:15 AM, NeilBrown wrote:
> On Thu, Mar 02 2017, Shaohua Li wrote:
>
>> On Wed, Mar 01, 2017 at 04:42:37PM +0800, Guoqing Jiang wrote:
>>> Since we have switched to sync way to handle METADATA_UPDATED
>>> msg for md-cluster, then process_metadata_update is depended
>>> on mddev->thread->wqueue.
>>>
>>> With the new change, clustered raid could possible hang if
>>> array received a METADATA_UPDATED msg after array unregistered
>>> mddev->thread, so we need to stop clustered raid earlier
>>> than before.
>>>
>>> And this change should be safe for non-clustered raid since
>>> all writes are stopped before the destroy. Also in md_run,
>>> we activate the personality (pers->run()) before activating
>>> the bitmap (bitmap_create()). So it is pleasingly symmetric
>>> to stop the bitmap (bitmap_destroy()) before stopping the
>>> personality (__md_stop() calls pers->free()).
>>>
>>> Reviewed-by: NeilBrown <neilb@suse.com>
>>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>>> ---
>>>   drivers/md/md.c | 30 +++++++++++++++++-------------
>>>   1 file changed, 17 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 44206bc6e3aa..e1d9116044ae 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
>>>   	/* stop the array and free an attached data structures.
>>>   	 * This is called from dm-raid
>>>   	 */
>>> -	__md_stop(mddev);
>>>   	bitmap_destroy(mddev);
>>> +	__md_stop(mddev);
>>>   	if (mddev->bio_set)
>>>   		bioset_free(mddev->bio_set);
>>>   }
>> Applied other 4 patches. But this one I still have concerns.
>>
>> For raid1, if a bio is behind IO, we return the bio to upper layer but don't
>> wait behind IO completion. So even there are no writes running, there might be
>> behind IO running. mddev_detach will do the wait which checks bitmap. If we
>> bitmap_destroy before __md_stop, mddev_detach doesn't do the wait.

Got it, thanks for explanation!

>>
>> Probably we should move mddev_detach out of __md_stop and always do:
>> mddev_detach()
>> bitmap_destroy()
>> __md_stop()
>>
>> This looks safer to me.
> Thanks for catching that.  I agree - mddev_detach should come before
> bitmap_destroy.
> I might be best to change __md_stop() to start
>
> static void __md_stop(struct mddev *mddev)
> {
> 	struct md_personality *pers = mddev->pers;
> 	mddev_detach(mddev);
> +	bitmap_destroy(mddev);
> 	/* Ensure ->event_work is done */
> 	flush_workqueue(md_misc_wq);

With this change, does it mean we destroy bitmap unconditionally even
if the mode is 2? Thanks.

> That would make the remainder of the patch (below) unnecessary.
> I think it is wrong anyway.
> We were correct to move the "bitmap_destroy() call up to the
> "if (mddev->pers)" case, but we were not correct to move the
> closing of bitmap_info.file.
> If a file is added to an array but that array is not started
> (so ->pers is not set), then stopping the array will not close the file
> with the change below, and that isn't good.

Hmm, thanks for reminder. Though I have some questions about above.
I guess the file is pointed to bitmap file, from mdadm manpage, I see
  "-b, --bitmap=" is used under create, build, grow and assemble mode.
But  how to add a file to a array which is not started? Please correct me
for my wrong understanding.

> So if we move bitmap_destroy() into  __md_stop() and remove it from
> do_md_stop and md_stop(), that might be exactly what we need.

How about the below changes? It may addresses both your concern and
Shaohua's concern, but not sure ...

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 79a99a1..a0e79d6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5555,7 +5555,6 @@ static void mddev_detach(struct mddev *mddev)
  static void __md_stop(struct mddev *mddev)
  {
         struct md_personality *pers = mddev->pers;
-       mddev_detach(mddev);
         /* Ensure ->event_work is done */
         flush_workqueue(md_misc_wq);
         spin_lock(&mddev->lock);
@@ -5574,8 +5573,9 @@ void md_stop(struct mddev *mddev)
         /* stop the array and free an attached data structures.
          * This is called from dm-raid
          */
-       __md_stop(mddev);
+       mddev_detach(mddev);
         bitmap_destroy(mddev);
+       __md_stop(mddev);
         if (mddev->bio_set)
                 bioset_free(mddev->bio_set);
  }
@@ -5688,6 +5688,10 @@ static int do_md_stop(struct mddev *mddev, int mode,
                         set_disk_ro(disk, 0);

                 __md_stop_writes(mddev);
+               mddev_detach(mddev);
+               if (mode == 0)
+                       bitmap_destroy(mddev);
+
                 __md_stop(mddev);
mddev->queue->backing_dev_info->congested_fn = NULL;

@@ -5713,7 +5717,8 @@ static int do_md_stop(struct mddev *mddev, int mode,
         if (mode == 0) {
                 pr_info("md: %s stopped.\n", mdname(mddev));

-               bitmap_destroy(mddev);
+               if (!mddev->pers)
+                       bitmap_destroy(mddev);
                 if (mddev->bitmap_info.file) {
                         struct file *f = mddev->bitmap_info.file;
                         spin_lock(&mddev->lock);

Thanks,
Guoqing

^ permalink raw reply related

* Re: [PATCH 3/3] md/raid5: sort bios
From: NeilBrown @ 2017-03-03  3:43 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: songliubraving, kernel-team
In-Reply-To: <01c6067566d787a815fc29aa07b36ad4dd0280f0.1487373517.git.shli@fb.com>

[-- Attachment #1: Type: text/plain, Size: 10131 bytes --]

On Fri, Feb 17 2017, Shaohua Li wrote:

> Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
> defers IO dispatching. The goal is to create better IO pattern. At that
> time, we don't sort the deffered IO and hope the block layer can do IO
> merge and sort. Now the raid5-cache writeback could create large amount
> of bios. And if we enable muti-thread for stripe handling, we can't
> control when to dispatch IO to raid disks. In a lot of time, we are
> dispatching IO which block layer can't do merge effectively.
>
> This patch moves further for the IO dispatching defer. We accumulate
> bios, but we don't dispatch all the bios after a threshold is met. This
> 'dispatch partial portion of bios' stragety allows bios coming in a
> large time window are sent to disks together. At the dispatching time,
> there is large chance the block layer can merge the bios. To make this
> more effective, we dispatch IO in ascending order. This increases
> request merge chance and reduces disk seek.

I can see the benefit of batching and sorting requests.

I wonder if the extra complexity of grouping together 512 requests, then
submitting the "first" 128 is really worth it.  Have you measured the
value of that?
If you just submitted every time you got 512 requests, you could use
list_sort() on the bio list and wouldn't need an array.

If an array really is best, it would be really nice if "sort" could pass
a 'void*' down to the cmp function, and it could sort all bios that are
*after* last_bio_pos first, and then the others.  That would make the
code much simpler.  I guess sort() could be changed (list_sort() already
has a 'priv' argument like this).

If we cannot change sort(), then maybe use lib/bsearch.c for the binary
search.  Performing two comparisons in the loop of a binary search
should get a *fail* in any algorithms class!!

The "pending_data" array that you have added to the r5conf structure
adds 4096 bytes.  This means it is larger than a page, which is best
avoided (though it is unlikely to cause problems).  I would allocate it
separately.

So there is a lot that I don't really like, but it seems like a good
idea in principle.

Thanks,
NeilBrown


>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5.c | 141 ++++++++++++++++++++++++++++++++++++++++++++---------
>  drivers/md/raid5.h |  11 ++++-
>  2 files changed, 127 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index d7c90da..fe6232f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -56,6 +56,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/flex_array.h>
>  #include <trace/events/block.h>
> +#include <linux/sort.h>
>  
>  #include "md.h"
>  #include "raid5.h"
> @@ -876,41 +877,121 @@ static int use_new_offset(struct r5conf *conf, struct stripe_head *sh)
>  	return 1;
>  }
>  
> -static void flush_deferred_bios(struct r5conf *conf)
> +static void dispatch_bio_list(struct bio_list *tmp)
>  {
> -	struct bio_list tmp;
>  	struct bio *bio;
>  
> +	while ((bio = bio_list_pop(tmp)))
> +		generic_make_request(bio);
> +}
> +
> +static int cmp_sh(const void *a, const void *b)
> +{
> +	const struct r5pending_data *da = a;
> +	const struct r5pending_data *db = b;
> +
> +	if (da->sector > db->sector)
> +		return 1;
> +	if (da->sector < db->sector)
> +		return -1;
> +	return 0;
> +}
> +
> +static void dispatch_defer_bios(struct r5conf *conf, int target,
> +				struct bio_list *list)
> +{
> +	int start = 0;
> +	int end = conf->pending_data_cnt - 1;
> +	int mid;
> +	int cnt = 0;
> +
> +	if (conf->pending_data_cnt == 0)
> +		return;
> +	sort(conf->pending_data, conf->pending_data_cnt,
> +		sizeof(struct r5pending_data), cmp_sh, NULL);
> +
> +	while (start <= end) {
> +		mid = (start + end) / 2;
> +		if (conf->pending_data[mid].sector > conf->last_bio_pos)
> +			end = mid - 1;
> +		else if (conf->pending_data[mid].sector < conf->last_bio_pos)
> +			start = mid + 1;
> +		else {
> +			start = mid;
> +			break;
> +		}
> +	}
> +
> +	end = conf->pending_data_cnt - 1;
> +	for (mid = start; mid <= end; mid++) {
> +		conf->last_bio_pos = conf->pending_data[mid].sector;
> +		bio_list_merge(list, &conf->pending_data[mid].bios);
> +
> +		cnt++;
> +		if (cnt >= target) {
> +			mid++;
> +			break;
> +		}
> +	}
> +	conf->pending_data_cnt -= cnt;
> +	BUG_ON(conf->pending_data_cnt < 0);
> +	if (mid <= end) {
> +		memmove(&conf->pending_data[start], &conf->pending_data[mid],
> +			(end - mid + 1) * sizeof(struct r5pending_data));
> +		return;
> +	}
> +
> +	for (mid = 0; mid < start; mid++) {
> +		conf->last_bio_pos = conf->pending_data[mid].sector;
> +		bio_list_merge(list, &conf->pending_data[mid].bios);
> +
> +		cnt++;
> +		if (cnt >= target) {
> +			mid++;
> +			break;
> +		}
> +	}
> +
> +	conf->pending_data_cnt -= mid;
> +	BUG_ON(conf->pending_data_cnt < 0);
> +	if (mid == start) {
> +		BUG_ON(conf->pending_data_cnt);
> +		return;
> +	}
> +	memmove(&conf->pending_data[0], &conf->pending_data[mid],
> +			(start - mid) * sizeof(struct r5pending_data));
> +}
> +
> +static void flush_deferred_bios(struct r5conf *conf)
> +{
> +	struct bio_list tmp = BIO_EMPTY_LIST;
> +
>  	if (!conf->batch_bio_dispatch || !conf->group_cnt)
>  		return;
>  
> -	bio_list_init(&tmp);
>  	spin_lock(&conf->pending_bios_lock);
> -	bio_list_merge(&tmp, &conf->pending_bios);
> -	bio_list_init(&conf->pending_bios);
> +	dispatch_defer_bios(conf, conf->pending_data_cnt, &tmp);
>  	spin_unlock(&conf->pending_bios_lock);
>  
> -	while ((bio = bio_list_pop(&tmp)))
> -		generic_make_request(bio);
> +	dispatch_bio_list(&tmp);
>  }
>  
> -static void defer_bio_issue(struct r5conf *conf, struct bio *bio)
> +static void defer_issue_bios(struct r5conf *conf, sector_t sector,
> +				struct bio_list *bios)
>  {
> -	/*
> -	 * change group_cnt will drain all bios, so this is safe
> -	 *
> -	 * A read generally means a read-modify-write, which usually means a
> -	 * randwrite, so we don't delay it
> -	 */
> -	if (!conf->batch_bio_dispatch || !conf->group_cnt ||
> -	    bio_op(bio) == REQ_OP_READ) {
> -		generic_make_request(bio);
> -		return;
> -	}
> +	struct bio_list tmp = BIO_EMPTY_LIST;
> +
>  	spin_lock(&conf->pending_bios_lock);
> -	bio_list_add(&conf->pending_bios, bio);
> +	conf->pending_data[conf->pending_data_cnt].sector = sector;
> +	bio_list_init(&conf->pending_data[conf->pending_data_cnt].bios);
> +	bio_list_merge(&conf->pending_data[conf->pending_data_cnt].bios,
> +		bios);
> +	conf->pending_data_cnt++;
> +	if (conf->pending_data_cnt >= PENDING_IO_MAX)
> +		dispatch_defer_bios(conf, PENDING_IO_ONE_FLUSH, &tmp);
>  	spin_unlock(&conf->pending_bios_lock);
> -	md_wakeup_thread(conf->mddev->thread);
> +
> +	dispatch_bio_list(&tmp);
>  }
>  
>  static void
> @@ -923,6 +1004,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  	struct r5conf *conf = sh->raid_conf;
>  	int i, disks = sh->disks;
>  	struct stripe_head *head_sh = sh;
> +	struct bio_list pending_bios = BIO_EMPTY_LIST;
> +	bool should_defer;
>  
>  	might_sleep();
>  
> @@ -939,6 +1022,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  		}
>  	}
>  
> +	should_defer = conf->batch_bio_dispatch && conf->group_cnt;
> +
>  	for (i = disks; i--; ) {
>  		int op, op_flags = 0;
>  		int replace_only = 0;
> @@ -1093,7 +1178,10 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  				trace_block_bio_remap(bdev_get_queue(bi->bi_bdev),
>  						      bi, disk_devt(conf->mddev->gendisk),
>  						      sh->dev[i].sector);
> -			defer_bio_issue(conf, bi);
> +			if (should_defer && op_is_write(op))
> +				bio_list_add(&pending_bios, bi);
> +			else
> +				generic_make_request(bi);
>  		}
>  		if (rrdev) {
>  			if (s->syncing || s->expanding || s->expanded
> @@ -1138,7 +1226,10 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  				trace_block_bio_remap(bdev_get_queue(rbi->bi_bdev),
>  						      rbi, disk_devt(conf->mddev->gendisk),
>  						      sh->dev[i].sector);
> -			defer_bio_issue(conf, rbi);
> +			if (should_defer && op_is_write(op))
> +				bio_list_add(&pending_bios, rbi);
> +			else
> +				generic_make_request(rbi);
>  		}
>  		if (!rdev && !rrdev) {
>  			if (op_is_write(op))
> @@ -1156,6 +1247,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  		if (sh != head_sh)
>  			goto again;
>  	}
> +
> +	if (should_defer && !bio_list_empty(&pending_bios))
> +		defer_issue_bios(conf, head_sh->sector, &pending_bios);
>  }
>  
>  static struct dma_async_tx_descriptor *
> @@ -6808,7 +6902,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
>  	atomic_set(&conf->active_stripes, 0);
>  	atomic_set(&conf->preread_active_stripes, 0);
>  	atomic_set(&conf->active_aligned_reads, 0);
> -	bio_list_init(&conf->pending_bios);
>  	spin_lock_init(&conf->pending_bios_lock);
>  	conf->batch_bio_dispatch = true;
>  	rdev_for_each(rdev, mddev) {
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 6b9d2e8..eb6d5d5 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -572,6 +572,13 @@ enum r5_cache_state {
>  				 */
>  };
>  
> +#define PENDING_IO_MAX 512
> +#define PENDING_IO_ONE_FLUSH 128
> +struct r5pending_data {
> +	sector_t sector; /* stripe sector */
> +	struct bio_list bios;
> +};
> +
>  struct r5conf {
>  	struct hlist_head	*stripe_hashtbl;
>  	/* only protect corresponding hash list and inactive_list */
> @@ -689,9 +696,11 @@ struct r5conf {
>  	int			worker_cnt_per_group;
>  	struct r5l_log		*log;
>  
> -	struct bio_list		pending_bios;
>  	spinlock_t		pending_bios_lock;
>  	bool			batch_bio_dispatch;
> +	struct r5pending_data	pending_data[PENDING_IO_MAX];
> +	int			pending_data_cnt;
> +	sector_t		last_bio_pos;
>  };
>  
>  
> -- 
> 2.9.3

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
From: NeilBrown @ 2017-03-03  5:20 UTC (permalink / raw)
  To: Guoqing Jiang, Shaohua Li; +Cc: linux-raid, shli
In-Reply-To: <58B8DE44.7040804@suse.com>

[-- Attachment #1: Type: text/plain, Size: 5863 bytes --]

On Fri, Mar 03 2017, Guoqing Jiang wrote:

> On 03/03/2017 06:15 AM, NeilBrown wrote:
>> On Thu, Mar 02 2017, Shaohua Li wrote:
>>
>>> On Wed, Mar 01, 2017 at 04:42:37PM +0800, Guoqing Jiang wrote:
>>>> Since we have switched to sync way to handle METADATA_UPDATED
>>>> msg for md-cluster, then process_metadata_update is depended
>>>> on mddev->thread->wqueue.
>>>>
>>>> With the new change, clustered raid could possible hang if
>>>> array received a METADATA_UPDATED msg after array unregistered
>>>> mddev->thread, so we need to stop clustered raid earlier
>>>> than before.
>>>>
>>>> And this change should be safe for non-clustered raid since
>>>> all writes are stopped before the destroy. Also in md_run,
>>>> we activate the personality (pers->run()) before activating
>>>> the bitmap (bitmap_create()). So it is pleasingly symmetric
>>>> to stop the bitmap (bitmap_destroy()) before stopping the
>>>> personality (__md_stop() calls pers->free()).
>>>>
>>>> Reviewed-by: NeilBrown <neilb@suse.com>
>>>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>>>> ---
>>>>   drivers/md/md.c | 30 +++++++++++++++++-------------
>>>>   1 file changed, 17 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 44206bc6e3aa..e1d9116044ae 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
>>>>   	/* stop the array and free an attached data structures.
>>>>   	 * This is called from dm-raid
>>>>   	 */
>>>> -	__md_stop(mddev);
>>>>   	bitmap_destroy(mddev);
>>>> +	__md_stop(mddev);
>>>>   	if (mddev->bio_set)
>>>>   		bioset_free(mddev->bio_set);
>>>>   }
>>> Applied other 4 patches. But this one I still have concerns.
>>>
>>> For raid1, if a bio is behind IO, we return the bio to upper layer but don't
>>> wait behind IO completion. So even there are no writes running, there might be
>>> behind IO running. mddev_detach will do the wait which checks bitmap. If we
>>> bitmap_destroy before __md_stop, mddev_detach doesn't do the wait.
>
> Got it, thanks for explanation!
>
>>>
>>> Probably we should move mddev_detach out of __md_stop and always do:
>>> mddev_detach()
>>> bitmap_destroy()
>>> __md_stop()
>>>
>>> This looks safer to me.
>> Thanks for catching that.  I agree - mddev_detach should come before
>> bitmap_destroy.
>> I might be best to change __md_stop() to start
>>
>> static void __md_stop(struct mddev *mddev)
>> {
>> 	struct md_personality *pers = mddev->pers;
>> 	mddev_detach(mddev);
>> +	bitmap_destroy(mddev);
>> 	/* Ensure ->event_work is done */
>> 	flush_workqueue(md_misc_wq);
>
> With this change, does it mean we destroy bitmap unconditionally even
> if the mode is 2? Thanks.

Yes, and we should destroy the bitmap.  In that case we need to return
the array to the start it was before RUN_ARRAY ioctl.

>
>> That would make the remainder of the patch (below) unnecessary.
>> I think it is wrong anyway.
>> We were correct to move the "bitmap_destroy() call up to the
>> "if (mddev->pers)" case, but we were not correct to move the
>> closing of bitmap_info.file.
>> If a file is added to an array but that array is not started
>> (so ->pers is not set), then stopping the array will not close the file
>> with the change below, and that isn't good.
>
> Hmm, thanks for reminder. Though I have some questions about above.
> I guess the file is pointed to bitmap file, from mdadm manpage, I see
>   "-b, --bitmap=" is used under create, build, grow and assemble mode.
> But  how to add a file to a array which is not started? Please correct me
> for my wrong understanding.

Call SET_BITMAP_FILE ioctl.
mdadm won't do this without then calling RUN_ARRAY (or similar).
But is it is possible and we should handle it.
o

>
>> So if we move bitmap_destroy() into  __md_stop() and remove it from
>> do_md_stop and md_stop(), that might be exactly what we need.
>
> How about the below changes? It may addresses both your concern and
> Shaohua's concern, but not sure ...


No, that is more complex than needed, and doesn't call bitmap_destroy()
always when required.

Thanks,
NeilBrown


>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 79a99a1..a0e79d6 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5555,7 +5555,6 @@ static void mddev_detach(struct mddev *mddev)
>   static void __md_stop(struct mddev *mddev)
>   {
>          struct md_personality *pers = mddev->pers;
> -       mddev_detach(mddev);
>          /* Ensure ->event_work is done */
>          flush_workqueue(md_misc_wq);
>          spin_lock(&mddev->lock);
> @@ -5574,8 +5573,9 @@ void md_stop(struct mddev *mddev)
>          /* stop the array and free an attached data structures.
>           * This is called from dm-raid
>           */
> -       __md_stop(mddev);
> +       mddev_detach(mddev);
>          bitmap_destroy(mddev);
> +       __md_stop(mddev);
>          if (mddev->bio_set)
>                  bioset_free(mddev->bio_set);
>   }
> @@ -5688,6 +5688,10 @@ static int do_md_stop(struct mddev *mddev, int mode,
>                          set_disk_ro(disk, 0);
>
>                  __md_stop_writes(mddev);
> +               mddev_detach(mddev);
> +               if (mode == 0)
> +                       bitmap_destroy(mddev);
> +
>                  __md_stop(mddev);
> mddev->queue->backing_dev_info->congested_fn = NULL;
>
> @@ -5713,7 +5717,8 @@ static int do_md_stop(struct mddev *mddev, int mode,
>          if (mode == 0) {
>                  pr_info("md: %s stopped.\n", mdname(mddev));
>
> -               bitmap_destroy(mddev);
> +               if (!mddev->pers)
> +                       bitmap_destroy(mddev);
>                  if (mddev->bitmap_info.file) {
>                          struct file *f = mddev->bitmap_info.file;
>                          spin_lock(&mddev->lock);
>
> Thanks,
> Guoqing

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Ming Lei @ 2017-03-03  6:22 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVP1gNEsUFdh1eKW9d01azis2+GMNO_kZjN8_ZQEbe8BxQ@mail.gmail.com>

On Fri, Mar 3, 2017 at 10:20 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Thu, Mar 2, 2017 at 3:52 PM, Shaohua Li <shli@kernel.org> wrote:
>> On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
>>> Hi Shaohua,
>>>
>>> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
>>> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>>> >> Use this helper, instead of direct access to .bi_vcnt.
>>> >
>>> > what We really need to do for the behind IO is:
>>> > - allocate memory and copy bio data to the memory
>>> > - let behind bio do IO against the memory
>>> >
>>> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
>>> > just track the new memory, we don't need use the bio_segments_all and access
>>> > bio_vec too.
>>>
>>> But we need to figure out how many vecs(each vec store one page) to be
>>> allocated for the cloned/behind bio, and that is the only value of
>>> bio_segments_all() here. Or you have idea to avoid that?
>>
>> As I said, the behind bio doesn't need to have the exactly same bio_vec
>> setting. We just allocate memory and copy original bio data to the memory,
>> then do IO against the new memory. The behind bio
>> segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT
>
> The equation isn't always correct, especially when bvec includes just
> part of page, and it is quite often in case of mkfs, in which one bvec often
> includes 512byte buffer.

Think it further, your idea could be workable and more clean, but the change
can be a bit big, looks we need to switch handling write behind into
the following way:

1) replace bio_clone_bioset_partial() with bio_allocate(nr_vecs), and 'nr_vecs'
is computed with your equation;

2) allocate 'nr_vecs' pages once and share them among all created bio in 1)

3) for each created bio, add each page into the bio via bio_add_page()

4) only for the 1st created bio, call bio_copy_data() to copy data from
master bio.

Let me know if you are OK with the above implementaion.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH RFC] test: revise 'test' and make it easier to understand
From: Guoqing Jiang @ 2017-03-03  8:06 UTC (permalink / raw)
  To: Zhilong Liu, neilb, Jes.Sorensen; +Cc: linux-raid
In-Reply-To: <1488250053-10436-1-git-send-email-zlliu@suse.com>



On 02/28/2017 10:47 AM, Zhilong Liu wrote:
> 1. use 'Tab' as the code style.
> 2. arrange the testing steps and provide the 'main' entrance.
> 3. draft the log_save feature, it captures the /proc/mdstat,
>     md superblock info, bitmap info and the detail dmesg.
> 4. modified the mdadm() func, adding the operation that clear
>     the superblock when create or build one new array, and it
>     would exit testing when mdadm command returned non-0 value.
> 5. delete no_errors() func, it only used in tests/04update-uuid,
>     I recommend the new mdadm() using method.
> 6. delete fast_sync() func.
> 7. testdev(), add the object file checking, otherwise this command
>     would create one regular file, it's one trouble thing.
> 8. add dmesg checking in do_test() func, it's necessary to check
>     dmesg whether or not printed abnormal message.
> 9. add checking conditions in main(), such as $pwd/raid6check need
>     exists, here is a prompt to remind users to 'make everything'
>     before testing; the $targetdir should mount under ext[2-4] FS,
>     because the external bitmap only supports ext, the bmap() API
>     of bitmap.c doesn't exist in all filesystem, such as btrfs.
>

I like the improvement for the test, and I would suggest you split
those changes into smaller patches, make each patch do one thing,
it would be easier for Jes to review I think, and you still can merge
them into one finally if Jes prefer one patch with huge changes, :-) .

Cheers,
Guoqing

^ permalink raw reply

* Re: [mdadm] compiler warning
From: Michael Shigorin @ 2017-03-03  8:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87bmtjnt5v.fsf@notabene.neil.brown.name>

On Fri, Mar 03, 2017 at 10:41:16AM +1100, NeilBrown wrote:
> > Thought I'd better write this. :)
> Thanks for the report, though we prefer such things to be sent to
>   linux-raid@vger.kernel.org

Ack!

> The warning is just telling us that the compiler is smart
> enough to deduce that whenever st is uninitialised, err = 0.
> So the code is safe, but maybe more complex than needed.
> I'll submit a patch to improve it a little.

Thanks for explanation, I'll file a bug against the compiler then.
Your commit has cleared the warning in the mean time.

-- 
 ---- WBR, Michael Shigorin / http://altlinux.org
  ------ http://opennet.ru / http://anna-news.info

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox