* Re: [PATCH] md/r5cache: flush data in memory during journal device failure
From: Shaohua Li @ 2017-03-15 22:48 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, Shaohua Li, NeilBrown, Kernel Team,
dan.j.williams@intel.com, hch@infradead.org
In-Reply-To: <5FFE3F62-D87A-46C5-B9D7-7A7501A32B90@fb.com>
On Tue, Mar 14, 2017 at 10:40:14PM +0000, Song Liu wrote:
>
> > On Mar 14, 2017, at 10:50 AM, Shaohua Li <shli@kernel.org> wrote:
> >
> > On Mon, Mar 13, 2017 at 04:36:26PM -0700, Song Liu wrote:
> >> For the raid456 with writeback cache, when journal device failed during
> >> normal operation, it is still possible to persist all data, as all
> >> pending data is still in stripe cache. However, the stripe will be
> >> marked as fail with s.log_failed. Thus, the write out from stripe cache
> >> cannot make progress.
> >>
> >> To unblock the write out in journal failures, this patch allows stripes
> >> with data injournal to make progress.
> >
> > what about the parity part? if log failed, we should skip journaling the parity.
> >
> > Thanks,
> > Shaohua
> >
>
> For stripes with data in journal (not flushed yet), the state machine
> can flush them out. The behavior is just like when there are no journal
> at all.
can you explain this more? I didn't find any place we check the failure bit and
so skip journaling the parity. Also include the description in the changelog.
> On the other hand, other writes will be gated by the log_failed flags,
> so the array appears to be read-only to upper layers.
>
> Thanks,
> Song
>
> >> The array should be read-only in journal failures. Therefore, pending
> >> writes (in dev->towrite) are excluded in this write (in delay_towrite).
> >>
> >> Signed-off-by: Song Liu <songliubraving@fb.com>
> >> ---
> >> drivers/md/raid5.c | 10 +++++++++-
> >> 1 file changed, 9 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> index 3233975..447d9dd 100644
> >> --- a/drivers/md/raid5.c
> >> +++ b/drivers/md/raid5.c
> >> @@ -3069,6 +3069,10 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
> >> * When LOG_CRITICAL, stripes with injournal == 0 will be sent to
> >> * no_space_stripes list.
> >> *
> >> + * 3. during journal failure
> >> + * In journal failure, we try to flush all cached data to raid disks
> >> + * based on data in stripe cache. The array is read-only to upper
> >> + * layers, so we would skip all pending writes.
> >> */
> >> static inline bool delay_towrite(struct r5conf *conf,
> >> struct r5dev *dev,
> >> @@ -3082,6 +3086,9 @@ static inline bool delay_towrite(struct r5conf *conf,
> >> if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
> >> s->injournal > 0)
> >> return true;
> >> + /* case 3 above */
> >> + if (s->log_failed && s->injournal)
> >> + return true;
> >> return false;
> >> }
> >>
> >> @@ -4721,7 +4728,8 @@ static void handle_stripe(struct stripe_head *sh)
> >> /* check if the array has lost more than max_degraded devices and,
> >> * if so, some requests might need to be failed.
> >> */
> >> - if (s.failed > conf->max_degraded || s.log_failed) {
> >> + if (s.failed > conf->max_degraded ||
> >> + (s.log_failed && s.injournal == 0)) {
> >> sh->check_state = 0;
> >> sh->reconstruct_state = 0;
> >> break_stripe_batch_list(sh, 0);
> >> --
> >> 2.9.3
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH 1/2] md/r5cache: improve add-journal
From: Shaohua Li @ 2017-03-15 22:51 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch,
jes.sorensen
In-Reply-To: <20170315182816.3627123-1-songliubraving@fb.com>
On Wed, Mar 15, 2017 at 11:28:15AM -0700, Song Liu wrote:
> 1. suspend the array before adding journal, so that we can add journal
> when the array is not read-only;
we can't call mddev_suspend in raid5d, because there is deadlock.
raid5_add_disk can be called in raid5d.
> 2. allow recreate journal when existing journal is Faulty. So that we can
> add-journal before removing failed journal.
this is weird usage, why don't we remove the failed journal first?
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/md.c | 5 +++--
> drivers/md/raid5.c | 6 ++----
> 2 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 42e68b2..ac3bd15 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6230,9 +6230,10 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)
> struct md_rdev *rdev2;
> bool has_journal = false;
>
> - /* make sure no existing journal disk */
> + /* make sure no active journal disk */
> rdev_for_each(rdev2, mddev) {
> - if (test_bit(Journal, &rdev2->flags)) {
> + if (test_bit(Journal, &rdev2->flags) &&
> + !test_bit(Faulty, &rdev2->flags)) {
> has_journal = true;
> break;
> }
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 447d9dd..ee8648b 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7758,11 +7758,9 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
> return -EBUSY;
>
> rdev->raid_disk = 0;
> - /*
> - * The array is in readonly mode if journal is missing, so no
> - * write requests running. We should be safe
> - */
> + mddev_suspend(mddev);
> log_init(conf, rdev);
> + mddev_resume(mddev);
> return 0;
> }
> if (mddev->recovery_disabled == conf->recovery_disabled)
> --
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/2] md/r5cache: journal remove support
From: Shaohua Li @ 2017-03-15 22:52 UTC (permalink / raw)
To: Song Liu
Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch,
jes.sorensen
In-Reply-To: <20170315182816.3627123-2-songliubraving@fb.com>
On Wed, Mar 15, 2017 at 11:28:16AM -0700, Song Liu wrote:
> When journal device of an array fails, the array is forced into read-only
> mode. To make the array normal without adding another journal device, we
> need to remove journal _feature_ from the array.
>
> This patch allows remove journal _feature_ from an array, For journal
> remove to work, existing journal should be either missing or faulty.
>
> Two flags are added to GET_ARRAY_INFO for mdadm.
> 1. MD_SB_HAS_JOURNAL: meaning the array have journal feature;
> 2. MD_SB_JOURNAL_REMOVABLE: meaning the journal is faulty or missing
>
> When both flags are set, mdadm can clear MD_SB_HAS_JOURNAL to remove
> journal _feature_.
please use the new 'consistency_policy' interface to remove journal support
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> drivers/md/md.c | 42 ++++++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/raid/md_p.h | 11 ++++++++---
> 2 files changed, 48 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ac3bd15..32ee994 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5981,6 +5981,25 @@ static void autorun_devices(int part)
> }
> #endif /* !MODULE */
>
> +/*
> + * the journal _feature_ is removable when:
> + * the array has journal support &&
> + * (journal is missing || journal is faulty)
> + */
> +static bool journal_removable(struct mddev *mddev)
> +{
> + struct md_rdev *rdev;
> +
> + if (!test_bit(MD_HAS_JOURNAL, &mddev->flags))
> + return false;
> +
> + rdev_for_each_rcu(rdev, mddev)
> + if (test_bit(Journal, &rdev->flags) &&
> + !test_bit(Faulty, &rdev->flags))
> + return false;
> + return true;
> +}
> +
> static int get_version(void __user *arg)
> {
> mdu_version_t ver;
> @@ -6041,6 +6060,10 @@ static int get_array_info(struct mddev *mddev, void __user *arg)
> info.state |= (1<<MD_SB_BITMAP_PRESENT);
> if (mddev_is_clustered(mddev))
> info.state |= (1<<MD_SB_CLUSTERED);
> + if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
> + info.state |= (1<<MD_SB_HAS_JOURNAL);
> + if (journal_removable(mddev))
> + info.state |= (1<<MD_SB_JOURNAL_REMOVABLE);
> info.active_disks = insync;
> info.working_disks = working;
> info.failed_disks = failed;
> @@ -6721,6 +6744,8 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
> /* calculate expected state,ignoring low bits */
> if (mddev->bitmap && mddev->bitmap_info.offset)
> state |= (1 << MD_SB_BITMAP_PRESENT);
> + if (journal_removable(mddev))
> + state |= (1 << MD_SB_JOURNAL_REMOVABLE);
>
> if (mddev->major_version != info->major_version ||
> mddev->minor_version != info->minor_version ||
> @@ -6730,8 +6755,11 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
> /* mddev->layout != info->layout || */
> mddev->persistent != !info->not_persistent ||
> mddev->chunk_sectors != info->chunk_size >> 9 ||
> - /* ignore bottom 8 bits of state, and allow SB_BITMAP_PRESENT to change */
> - ((state^info->state) & 0xfffffe00)
> + /*
> + * ignore bottom 8 bits of state, and allow SB_BITMAP_PRESENT
> + * and SB_HAS_JOURNAL to change
> + */
> + ((state^info->state) & 0xfffffc00)
> )
> return -EINVAL;
> /* Check there is only one change */
> @@ -6743,6 +6771,8 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
> cnt++;
> if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT))
> cnt++;
> + if ((state ^ info->state) & (1<<MD_SB_HAS_JOURNAL))
> + cnt++;
> if (cnt == 0)
> return 0;
> if (cnt > 1)
> @@ -6831,6 +6861,14 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
> mddev->bitmap_info.offset = 0;
> }
> }
> + if ((state ^ info->state) & (1<<MD_SB_HAS_JOURNAL)) {
> + if (!journal_removable(mddev)) {
> + rv = -EINVAL;
> + goto err;
> + }
> + clear_bit(MD_HAS_JOURNAL, &mddev->flags);
> + }
> +
> md_update_sb(mddev, 1);
> return rv;
> err:
> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
> index d9a1ead..b1f2b63 100644
> --- a/include/uapi/linux/raid/md_p.h
> +++ b/include/uapi/linux/raid/md_p.h
> @@ -1,15 +1,15 @@
> /*
> md_p.h : physical layout of Linux RAID devices
> Copyright (C) 1996-98 Ingo Molnar, Gadi Oxman
> -
> +
> This program is free software; you can redistribute it and/or modify
> it under the terms of the GNU General Public License as published by
> the Free Software Foundation; either version 2, or (at your option)
> any later version.
> -
> +
> You should have received a copy of the GNU General Public License
> (for example /usr/src/linux/COPYING); if not, write to the Free
> - Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> */
>
> #ifndef _MD_P_H
> @@ -119,6 +119,11 @@ typedef struct mdp_device_descriptor_s {
>
> #define MD_SB_CLUSTERED 5 /* MD is clustered */
> #define MD_SB_BITMAP_PRESENT 8 /* bitmap may be present nearby */
> +#define MD_SB_HAS_JOURNAL 9
> +#define MD_SB_JOURNAL_REMOVABLE 10 /* journal _feature_ can be removed,
> + * which means the journal is either
> + * missing or Faulty
> + */
>
> /*
> * Notes:
> --
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [md PATCH 02/15] md/raid5: simplfy delaying of writes while metadata is updated.
From: Shaohua Li @ 2017-03-15 23:03 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <148954711228.18641.2048575896322496918.stgit@noble>
On Wed, Mar 15, 2017 at 02:05:12PM +1100, Neil Brown wrote:
> If a device fails during a write, we must ensure the failure is
> recorded in the metadata before the completion of the write is
> acknowleged.
>
> Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before
> write request returns.") added code for this, but it was
> unnecessarily complicated. We already had similar functionality for
> handling updates to the bad-block-list, thanks to Commit de393cdea66c
> ("md: make it easier to wait for bad blocks to be acknowledged.")
>
> So revert most of the former commit, and instead avoid collecting
> completed writes if MD_CHANGE_PENDING is set. raid5d() will then flush
> the metadata and retry the stripe_head.
> As this change can leave a stripe_head ready for handling immediately
> after handle_active_stripes() returns, we change raid5_do_work() to
> pause when MD_CHANGE_PENDING is set, so that it doesn't spin.
>
> We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
> asynchronously. After analyse_stripe(), we have collected stable data
> about the state of devices, which will be used to make decisions.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> drivers/md/raid5.c | 31 ++++++++-----------------------
> drivers/md/raid5.h | 3 ---
> 2 files changed, 8 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cc2d039b4aae..f990f74901d2 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4690,7 +4690,8 @@ static void handle_stripe(struct stripe_head *sh)
> if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
> goto finish;
>
> - if (s.handle_bad_blocks) {
> + if (s.handle_bad_blocks ||
> + test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
> set_bit(STRIPE_HANDLE, &sh->state);
> goto finish;
> }
> @@ -5020,15 +5021,8 @@ static void handle_stripe(struct stripe_head *sh)
> md_wakeup_thread(conf->mddev->thread);
> }
>
> - if (!bio_list_empty(&s.return_bi)) {
> - if (test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
> - spin_lock_irq(&conf->device_lock);
> - bio_list_merge(&conf->return_bi, &s.return_bi);
> - spin_unlock_irq(&conf->device_lock);
> - md_wakeup_thread(conf->mddev->thread);
> - } else
> - return_io(&s.return_bi);
> - }
> + if (!bio_list_empty(&s.return_bi))
> + return_io(&s.return_bi);
>
> clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
> }
> @@ -6225,6 +6219,7 @@ static void raid5_do_work(struct work_struct *work)
> struct r5worker *worker = container_of(work, struct r5worker, work);
> struct r5worker_group *group = worker->group;
> struct r5conf *conf = group->conf;
> + struct mddev *mddev = conf->mddev;
> int group_id = group - conf->worker_groups;
> int handled;
> struct blk_plug plug;
> @@ -6245,6 +6240,9 @@ static void raid5_do_work(struct work_struct *work)
> if (!batch_size && !released)
> break;
> handled += batch_size;
> + wait_event_lock_irq(mddev->sb_wait,
> + !test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags),
MD_SB_CHANGE_PENDING?
> + conf->device_lock);
> }
> pr_debug("%d stripes handled\n", handled);
>
> @@ -6272,18 +6270,6 @@ static void raid5d(struct md_thread *thread)
>
> md_check_recovery(mddev);
>
> - if (!bio_list_empty(&conf->return_bi) &&
> - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
> - struct bio_list tmp = BIO_EMPTY_LIST;
> - spin_lock_irq(&conf->device_lock);
> - if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
> - bio_list_merge(&tmp, &conf->return_bi);
> - bio_list_init(&conf->return_bi);
> - }
> - spin_unlock_irq(&conf->device_lock);
> - return_io(&tmp);
> - }
> -
> blk_start_plug(&plug);
> handled = 0;
> spin_lock_irq(&conf->device_lock);
> @@ -6935,7 +6921,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
> INIT_LIST_HEAD(&conf->hold_list);
> INIT_LIST_HEAD(&conf->delayed_list);
> INIT_LIST_HEAD(&conf->bitmap_list);
> - bio_list_init(&conf->return_bi);
> init_llist_head(&conf->released_stripes);
> atomic_set(&conf->active_stripes, 0);
> atomic_set(&conf->preread_active_stripes, 0);
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index ba5b7a3790af..13800dc9dd88 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -638,9 +638,6 @@ struct r5conf {
> int skip_copy; /* Don't copy data from bio to stripe cache */
> struct list_head *last_hold; /* detect hold_list promotions */
>
> - /* bios to have bi_end_io called after metadata is synced */
> - struct bio_list return_bi;
> -
> atomic_t reshape_stripes; /* stripes with pending writes for reshape */
> /* unfortunately we need two cache names as we temporarily have
> * two caches.
>
>
^ permalink raw reply
* Re: [PATCH] md/r5cache: flush data in memory during journal device failure
From: Song Liu @ 2017-03-15 23:45 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-raid, Shaohua Li, NeilBrown, Kernel Team, Dan Williams,
hch@infradead.org
In-Reply-To: <20170314175048.rjyaufwgaclsmhdz@kernel.org>
> On Mar 14, 2017, at 10:50 AM, Shaohua Li <shli@kernel.org> wrote:
>
> On Mon, Mar 13, 2017 at 04:36:26PM -0700, Song Liu wrote:
>> For the raid456 with writeback cache, when journal device failed during
>> normal operation, it is still possible to persist all data, as all
>> pending data is still in stripe cache. However, the stripe will be
>> marked as fail with s.log_failed. Thus, the write out from stripe cache
>> cannot make progress.
>>
>> To unblock the write out in journal failures, this patch allows stripes
>> with data injournal to make progress.
>
> what about the parity part? if log failed, we should skip journaling the parity.
>
Hmm.. I guess we need to check Faulty bit in some functions. Simply checking
log is not NULL is not enough.
I will update the patch.
Thanks,
Song
^ permalink raw reply
* Re: [PATCH 1/2] md/r5cache: improve add-journal
From: Song Liu @ 2017-03-15 23:48 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-raid, Shaohua Li, NeilBrown, Kernel Team, Dan Williams,
Christoph Hellwig, jes.sorensen@gmail.com
In-Reply-To: <20170315225100.mr4glirjdjx237fq@kernel.org>
> On Mar 15, 2017, at 3:51 PM, Shaohua Li <shli@kernel.org> wrote:
>
> On Wed, Mar 15, 2017 at 11:28:15AM -0700, Song Liu wrote:
>> 1. suspend the array before adding journal, so that we can add journal
>> when the array is not read-only;
>
> we can't call mddev_suspend in raid5d, because there is deadlock.
> raid5_add_disk can be called in raid5d.
I see. I guess we can just require setting read only before adding journal.
>
>> 2. allow recreate journal when existing journal is Faulty. So that we can
>> add-journal before removing failed journal.
>
> this is weird usage, why don't we remove the failed journal first?
This is really not necessary.
I guess we can just drop this patch.
Thanks,
Song
^ permalink raw reply
* Re: [md PATCH 10/15] md/raid1: stop using bi_phys_segment
From: Shaohua Li @ 2017-03-16 0:13 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <148954711389.18641.6044680366998154084.stgit@noble>
On Wed, Mar 15, 2017 at 02:05:14PM +1100, Neil Brown wrote:
> Change to use bio->__bi_remaining to count number of r1bio attached
> to a bio.
> See precious raid10 patch for more details.
>
> Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending
> used to measure different things, but were being compared.
>
> This patch fixes another bug in that nr_pending previously did not
> could write-behind requests, so behind writes could continue while
> resync was happening. How that nr_pending counts all r1_bio,
> the resync cannot commence until the behind writes have completed.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> drivers/md/raid1.c | 87 +++++++++++++---------------------------------------
> 1 file changed, 22 insertions(+), 65 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7e509a894f15..e566407b196f 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -246,35 +246,18 @@ static void reschedule_retry(struct r1bio *r1_bio)
> static void call_bio_endio(struct r1bio *r1_bio)
> {
> struct bio *bio = r1_bio->master_bio;
> - int done;
> struct r1conf *conf = r1_bio->mddev->private;
> sector_t bi_sector = bio->bi_iter.bi_sector;
>
> - if (bio->bi_phys_segments) {
> - unsigned long flags;
> - spin_lock_irqsave(&conf->device_lock, flags);
> - bio->bi_phys_segments--;
> - done = (bio->bi_phys_segments == 0);
> - spin_unlock_irqrestore(&conf->device_lock, flags);
> - /*
> - * make_request() might be waiting for
> - * bi_phys_segments to decrease
> - */
> - wake_up(&conf->wait_barrier);
> - } else
> - done = 1;
> -
> if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
> bio->bi_error = -EIO;
>
> - if (done) {
> - bio_endio(bio);
> - /*
> - * Wake up any possible resync thread that waits for the device
> - * to go idle.
> - */
> - allow_barrier(conf, bi_sector);
> - }
> + bio_endio(bio);
> + /*
> + * Wake up any possible resync thread that waits for the device
> + * to go idle.
> + */
> + allow_barrier(conf, bi_sector);
I think this one should be r1_bio->sector instead of master_bio->sector,
because multiple r1_bio could be attached to a master_bio. Maybe not change
anything, because both sector should be in the same barrier unit, but we'd
better to be consistent.
Thanks,
Shaohua
^ permalink raw reply
* Re: [md PATCH 15/15] MD: use per-cpu counter for writes_pending
From: Shaohua Li @ 2017-03-16 1:05 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <148954711465.18641.8222940807591984069.stgit@noble>
On Wed, Mar 15, 2017 at 02:05:14PM +1100, Neil Brown wrote:
> The 'writes_pending' counter is used to determine when the
> array is stable so that it can be marked in the superblock
> as "Clean". Consequently it needs to be updated frequently
> but only checked for zero occasionally. Recent changes to
> raid5 cause the count to be updated even more often - once
> per 4K rather than once per bio. This provided
> justification for making the updates more efficient.
>
> So we replace the atomic counter a percpu-refcount.
> This can be incremented and decremented cheaply most of the
> time, and can be switched to "atomic" mode when more
> precise counting is needed. As it is possible for multiple
> threads to want a precise count, we introduce a
> "sync_checker" counter to count the number of threads
> in "set_in_sync()", and only switch the refcount back
> to percpu mode when that is zero.
>
> We need to be careful about races between set_in_sync()
> setting ->in_sync to 1, and md_write_start() setting it
> to zero. md_write_start() holds the rcu_read_lock()
> while checking if the refcount is in percpu mode. If
> it is, then we know a switch to 'atomic' will not happen until
> after we call rcu_read_unlock(), in which case set_in_sync()
> will see the elevated count, and not set in_sync to 1.
> If it is not in percpu mode, we take the mddev->lock to
> ensure proper synchronization.
>
> It is no longer possible to quickly check if the count is zero, which
> we previously did to update a timer or to schedule the md_thread.
> So now we do these every time we decrement that counter, but make
> sure they are fast.
>
> mod_timer() already optimizes the case where the timeout value doesn't
> actually change. We leverage that further by always rounding off the
> jiffies to the timeout value. This may delay the marking of 'clean'
> slightly, but ensure we only perform atomic operation here when absolutely
> needed.
>
> md_wakeup_thread() current always calls wake_up(), even if
> THREAD_WAKEUP is already set. That too can be optimised to avoid
> calls to wake_up().
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> drivers/md/md.c | 70 +++++++++++++++++++++++++++++++++++++------------------
> drivers/md/md.h | 3 ++
> 2 files changed, 49 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c33ec97b23d4..adf2b5bdfd67 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -65,6 +65,8 @@
> #include <linux/raid/md_p.h>
> #include <linux/raid/md_u.h>
> #include <linux/slab.h>
> +#include <linux/percpu-refcount.h>
> +
> #include <trace/events/block.h>
> #include "md.h"
> #include "bitmap.h"
> @@ -2255,16 +2257,24 @@ static void export_array(struct mddev *mddev)
> static bool set_in_sync(struct mddev *mddev)
> {
> WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
> - if (atomic_read(&mddev->writes_pending) == 0) {
> - if (mddev->in_sync == 0) {
> + if (!mddev->in_sync) {
> + mddev->sync_checkers++;
> + spin_unlock(&mddev->lock);
> + percpu_ref_switch_to_atomic_sync(&mddev->writes_pending);
> + spin_lock(&mddev->lock);
> + if (!mddev->in_sync &&
> + percpu_ref_is_zero(&mddev->writes_pending)) {
> mddev->in_sync = 1;
> + /*
> + * Ensure in_sync is visible before switch back
> + * to percpu
> + */
> smp_mb();
> - if (atomic_read(&mddev->writes_pending))
> - /* lost a race with md_write_start() */
> - mddev->in_sync = 0;
> set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
> sysfs_notify_dirent_safe(mddev->sysfs_state);
> }
> + if (--mddev->sync_checkers == 0)
> + percpu_ref_switch_to_percpu(&mddev->writes_pending);
percpu_ref_switch_to_percpu could sleep. Maybe let set_in_sync return a value
to indicate if caller should drop lock and then do the switch.
> }
> if (mddev->safemode == 1)
> mddev->safemode = 0;
> @@ -5120,6 +5130,7 @@ static void md_free(struct kobject *ko)
> del_gendisk(mddev->gendisk);
> put_disk(mddev->gendisk);
> }
> + percpu_ref_exit(&mddev->writes_pending);
>
> kfree(mddev);
> }
> @@ -5145,6 +5156,8 @@ static void mddev_delayed_delete(struct work_struct *ws)
> kobject_put(&mddev->kobj);
> }
>
> +static void no_op(struct percpu_ref *r) {}
> +
> static int md_alloc(dev_t dev, char *name)
> {
> static DEFINE_MUTEX(disks_mutex);
> @@ -5196,6 +5209,10 @@ static int md_alloc(dev_t dev, char *name)
> blk_queue_make_request(mddev->queue, md_make_request);
> blk_set_stacking_limits(&mddev->queue->limits);
>
> + if (percpu_ref_init(&mddev->writes_pending, no_op, 0, GFP_KERNEL) < 0)
> + goto abort;
> + /* We want to start with the refcount at zero */
> + percpu_ref_put(&mddev->writes_pending);
> disk = alloc_disk(1 << shift);
> if (!disk) {
> blk_cleanup_queue(mddev->queue);
> @@ -5279,11 +5296,10 @@ static void md_safemode_timeout(unsigned long data)
> {
> struct mddev *mddev = (struct mddev *) data;
>
> - if (!atomic_read(&mddev->writes_pending)) {
> - mddev->safemode = 1;
> - if (mddev->external)
> - sysfs_notify_dirent_safe(mddev->sysfs_state);
> - }
> + mddev->safemode = 1;
> + if (mddev->external)
> + sysfs_notify_dirent_safe(mddev->sysfs_state);
> +
> md_wakeup_thread(mddev->thread);
> }
>
> @@ -5488,7 +5504,6 @@ int md_run(struct mddev *mddev)
> } else if (mddev->ro == 2) /* auto-readonly not meaningful */
> mddev->ro = 0;
>
> - atomic_set(&mddev->writes_pending,0);
> atomic_set(&mddev->max_corr_read_errors,
> MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
> mddev->safemode = 0;
> @@ -7351,8 +7366,8 @@ void md_wakeup_thread(struct md_thread *thread)
> {
> if (thread) {
> pr_debug("md: waking up MD thread %s.\n", thread->tsk->comm);
> - set_bit(THREAD_WAKEUP, &thread->flags);
> - wake_up(&thread->wqueue);
> + if (!test_and_set_bit(THREAD_WAKEUP, &thread->flags))
> + wake_up(&thread->wqueue);
> }
> }
> EXPORT_SYMBOL(md_wakeup_thread);
> @@ -7886,6 +7901,7 @@ EXPORT_SYMBOL(md_done_sync);
> */
> void md_write_start(struct mddev *mddev, struct bio *bi)
> {
> + unsigned long __percpu *notused;
> int did_change = 0;
> if (bio_data_dir(bi) != WRITE)
> return;
> @@ -7899,11 +7915,12 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
> md_wakeup_thread(mddev->sync_thread);
> did_change = 1;
> }
> - atomic_inc(&mddev->writes_pending);
> + rcu_read_lock();
> + percpu_ref_get(&mddev->writes_pending);
> smp_mb(); /* Match smp_mb in set_in_sync() */
> if (mddev->safemode == 1)
> mddev->safemode = 0;
> - if (mddev->in_sync) {
> + if (mddev->in_sync || !__ref_is_percpu(&mddev->writes_pending, ¬used)) {
this need review from percpu-ref guys. The api doc declares it's internal use only.
Thanks,
Shaohua
^ permalink raw reply
* Re: [md PATCH 00/15 v2] remove all abuse of bi_phys_segments
From: Shaohua Li @ 2017-03-16 1:12 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <148954692173.18641.1294690639716682540.stgit@noble>
On Wed, Mar 15, 2017 at 02:05:11PM +1100, Neil Brown wrote:
> This is a revised version of my series to remove all abuse of
> bi_phys_segments from md/raid.
>
> Most of the effort is for raid5 - raid1 and raid10 are
> comparatively straight forward.
>
> I've also include some optimization of md_write_start() which is now
> used more heavily. This now uses percpu-refcount, which needed a
> small modification.
looks quite good, there are some minor issues I replied separately.
> As noted in some of these changelogs, a couple of bugs are fixed
> in raid1/raid10. We should probably create fixes suitable for
> -stable.
At first glance, it seems not easy to create a stable fix. The
bio_inc_remaining only exists in several kernel versions. There isn't easy way
to fix the issue without the per-bio accounting mechanism.
> In this series there is a 'block' patch and a 'percpu-refcount' patch,
> which should probably go in through other trees. I might send those
> off as appropriate soonish.
Please do. The block patch could go other tree, as it's mostly separately. The
percpu-refcount patch could be a problem if it goes other tree.
Thanks,
Shaohua
^ permalink raw reply
* Re: [md PATCH 02/15] md/raid5: simplfy delaying of writes while metadata is updated.
From: NeilBrown @ 2017-03-16 2:45 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <20170315230356.3zizpl44atdikrt7@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 5099 bytes --]
On Wed, Mar 15 2017, Shaohua Li wrote:
> On Wed, Mar 15, 2017 at 02:05:12PM +1100, Neil Brown wrote:
>> If a device fails during a write, we must ensure the failure is
>> recorded in the metadata before the completion of the write is
>> acknowleged.
>>
>> Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before
>> write request returns.") added code for this, but it was
>> unnecessarily complicated. We already had similar functionality for
>> handling updates to the bad-block-list, thanks to Commit de393cdea66c
>> ("md: make it easier to wait for bad blocks to be acknowledged.")
>>
>> So revert most of the former commit, and instead avoid collecting
>> completed writes if MD_CHANGE_PENDING is set. raid5d() will then flush
>> the metadata and retry the stripe_head.
>> As this change can leave a stripe_head ready for handling immediately
>> after handle_active_stripes() returns, we change raid5_do_work() to
>> pause when MD_CHANGE_PENDING is set, so that it doesn't spin.
>>
>> We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
>> asynchronously. After analyse_stripe(), we have collected stable data
>> about the state of devices, which will be used to make decisions.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>> drivers/md/raid5.c | 31 ++++++++-----------------------
>> drivers/md/raid5.h | 3 ---
>> 2 files changed, 8 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index cc2d039b4aae..f990f74901d2 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -4690,7 +4690,8 @@ static void handle_stripe(struct stripe_head *sh)
>> if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
>> goto finish;
>>
>> - if (s.handle_bad_blocks) {
>> + if (s.handle_bad_blocks ||
>> + test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>> set_bit(STRIPE_HANDLE, &sh->state);
>> goto finish;
>> }
>> @@ -5020,15 +5021,8 @@ static void handle_stripe(struct stripe_head *sh)
>> md_wakeup_thread(conf->mddev->thread);
>> }
>>
>> - if (!bio_list_empty(&s.return_bi)) {
>> - if (test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>> - spin_lock_irq(&conf->device_lock);
>> - bio_list_merge(&conf->return_bi, &s.return_bi);
>> - spin_unlock_irq(&conf->device_lock);
>> - md_wakeup_thread(conf->mddev->thread);
>> - } else
>> - return_io(&s.return_bi);
>> - }
>> + if (!bio_list_empty(&s.return_bi))
>> + return_io(&s.return_bi);
>>
>> clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
>> }
>> @@ -6225,6 +6219,7 @@ static void raid5_do_work(struct work_struct *work)
>> struct r5worker *worker = container_of(work, struct r5worker, work);
>> struct r5worker_group *group = worker->group;
>> struct r5conf *conf = group->conf;
>> + struct mddev *mddev = conf->mddev;
>> int group_id = group - conf->worker_groups;
>> int handled;
>> struct blk_plug plug;
>> @@ -6245,6 +6240,9 @@ static void raid5_do_work(struct work_struct *work)
>> if (!batch_size && !released)
>> break;
>> handled += batch_size;
>> + wait_event_lock_irq(mddev->sb_wait,
>> + !test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags),
> MD_SB_CHANGE_PENDING?
Yes. Thanks for catching.
NeilBrown
>
>> + conf->device_lock);
>> }
>> pr_debug("%d stripes handled\n", handled);
>>
>> @@ -6272,18 +6270,6 @@ static void raid5d(struct md_thread *thread)
>>
>> md_check_recovery(mddev);
>>
>> - if (!bio_list_empty(&conf->return_bi) &&
>> - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
>> - struct bio_list tmp = BIO_EMPTY_LIST;
>> - spin_lock_irq(&conf->device_lock);
>> - if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
>> - bio_list_merge(&tmp, &conf->return_bi);
>> - bio_list_init(&conf->return_bi);
>> - }
>> - spin_unlock_irq(&conf->device_lock);
>> - return_io(&tmp);
>> - }
>> -
>> blk_start_plug(&plug);
>> handled = 0;
>> spin_lock_irq(&conf->device_lock);
>> @@ -6935,7 +6921,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
>> INIT_LIST_HEAD(&conf->hold_list);
>> INIT_LIST_HEAD(&conf->delayed_list);
>> INIT_LIST_HEAD(&conf->bitmap_list);
>> - bio_list_init(&conf->return_bi);
>> init_llist_head(&conf->released_stripes);
>> atomic_set(&conf->active_stripes, 0);
>> atomic_set(&conf->preread_active_stripes, 0);
>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index ba5b7a3790af..13800dc9dd88 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -638,9 +638,6 @@ struct r5conf {
>> int skip_copy; /* Don't copy data from bio to stripe cache */
>> struct list_head *last_hold; /* detect hold_list promotions */
>>
>> - /* bios to have bi_end_io called after metadata is synced */
>> - struct bio_list return_bi;
>> -
>> atomic_t reshape_stripes; /* stripes with pending writes for reshape */
>> /* unfortunately we need two cache names as we temporarily have
>> * two caches.
>>
>>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [md PATCH 10/15] md/raid1: stop using bi_phys_segment
From: NeilBrown @ 2017-03-16 2:49 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <20170316001355.lyhq7lkyvndsocwo@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 2836 bytes --]
On Wed, Mar 15 2017, Shaohua Li wrote:
> On Wed, Mar 15, 2017 at 02:05:14PM +1100, Neil Brown wrote:
>> Change to use bio->__bi_remaining to count number of r1bio attached
>> to a bio.
>> See precious raid10 patch for more details.
>>
>> Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending
>> used to measure different things, but were being compared.
>>
>> This patch fixes another bug in that nr_pending previously did not
>> could write-behind requests, so behind writes could continue while
>> resync was happening. How that nr_pending counts all r1_bio,
>> the resync cannot commence until the behind writes have completed.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>> drivers/md/raid1.c | 87 +++++++++++++---------------------------------------
>> 1 file changed, 22 insertions(+), 65 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 7e509a894f15..e566407b196f 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -246,35 +246,18 @@ static void reschedule_retry(struct r1bio *r1_bio)
>> static void call_bio_endio(struct r1bio *r1_bio)
>> {
>> struct bio *bio = r1_bio->master_bio;
>> - int done;
>> struct r1conf *conf = r1_bio->mddev->private;
>> sector_t bi_sector = bio->bi_iter.bi_sector;
>>
>> - if (bio->bi_phys_segments) {
>> - unsigned long flags;
>> - spin_lock_irqsave(&conf->device_lock, flags);
>> - bio->bi_phys_segments--;
>> - done = (bio->bi_phys_segments == 0);
>> - spin_unlock_irqrestore(&conf->device_lock, flags);
>> - /*
>> - * make_request() might be waiting for
>> - * bi_phys_segments to decrease
>> - */
>> - wake_up(&conf->wait_barrier);
>> - } else
>> - done = 1;
>> -
>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>> bio->bi_error = -EIO;
>>
>> - if (done) {
>> - bio_endio(bio);
>> - /*
>> - * Wake up any possible resync thread that waits for the device
>> - * to go idle.
>> - */
>> - allow_barrier(conf, bi_sector);
>> - }
>> + bio_endio(bio);
>> + /*
>> + * Wake up any possible resync thread that waits for the device
>> + * to go idle.
>> + */
>> + allow_barrier(conf, bi_sector);
>
> I think this one should be r1_bio->sector instead of master_bio->sector,
> because multiple r1_bio could be attached to a master_bio. Maybe not change
> anything, because both sector should be in the same barrier unit, but we'd
> better to be consistent.
Yes, I agree. Both that it won't make a practical difference and that
it should be changed.
I just noticed another little problem with this patch.
The chunk in handle_read_error() should have added inc_pending()
near where it added bio_inc_remaining().
Shall I just resend the individual patch (and the raid5 one?).
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [md PATCH 15/15] MD: use per-cpu counter for writes_pending
From: NeilBrown @ 2017-03-16 2:57 UTC (permalink / raw)
To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <20170316010557.bjccpzke4gta6tfr@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 7899 bytes --]
On Wed, Mar 15 2017, Shaohua Li wrote:
> On Wed, Mar 15, 2017 at 02:05:14PM +1100, Neil Brown wrote:
>> The 'writes_pending' counter is used to determine when the
>> array is stable so that it can be marked in the superblock
>> as "Clean". Consequently it needs to be updated frequently
>> but only checked for zero occasionally. Recent changes to
>> raid5 cause the count to be updated even more often - once
>> per 4K rather than once per bio. This provided
>> justification for making the updates more efficient.
>>
>> So we replace the atomic counter a percpu-refcount.
>> This can be incremented and decremented cheaply most of the
>> time, and can be switched to "atomic" mode when more
>> precise counting is needed. As it is possible for multiple
>> threads to want a precise count, we introduce a
>> "sync_checker" counter to count the number of threads
>> in "set_in_sync()", and only switch the refcount back
>> to percpu mode when that is zero.
>>
>> We need to be careful about races between set_in_sync()
>> setting ->in_sync to 1, and md_write_start() setting it
>> to zero. md_write_start() holds the rcu_read_lock()
>> while checking if the refcount is in percpu mode. If
>> it is, then we know a switch to 'atomic' will not happen until
>> after we call rcu_read_unlock(), in which case set_in_sync()
>> will see the elevated count, and not set in_sync to 1.
>> If it is not in percpu mode, we take the mddev->lock to
>> ensure proper synchronization.
>>
>> It is no longer possible to quickly check if the count is zero, which
>> we previously did to update a timer or to schedule the md_thread.
>> So now we do these every time we decrement that counter, but make
>> sure they are fast.
>>
>> mod_timer() already optimizes the case where the timeout value doesn't
>> actually change. We leverage that further by always rounding off the
>> jiffies to the timeout value. This may delay the marking of 'clean'
>> slightly, but ensure we only perform atomic operation here when absolutely
>> needed.
>>
>> md_wakeup_thread() current always calls wake_up(), even if
>> THREAD_WAKEUP is already set. That too can be optimised to avoid
>> calls to wake_up().
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>> drivers/md/md.c | 70 +++++++++++++++++++++++++++++++++++++------------------
>> drivers/md/md.h | 3 ++
>> 2 files changed, 49 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index c33ec97b23d4..adf2b5bdfd67 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -65,6 +65,8 @@
>> #include <linux/raid/md_p.h>
>> #include <linux/raid/md_u.h>
>> #include <linux/slab.h>
>> +#include <linux/percpu-refcount.h>
>> +
>> #include <trace/events/block.h>
>> #include "md.h"
>> #include "bitmap.h"
>> @@ -2255,16 +2257,24 @@ static void export_array(struct mddev *mddev)
>> static bool set_in_sync(struct mddev *mddev)
>> {
>> WARN_ON_ONCE(!spin_is_locked(&mddev->lock));
>> - if (atomic_read(&mddev->writes_pending) == 0) {
>> - if (mddev->in_sync == 0) {
>> + if (!mddev->in_sync) {
>> + mddev->sync_checkers++;
>> + spin_unlock(&mddev->lock);
>> + percpu_ref_switch_to_atomic_sync(&mddev->writes_pending);
>> + spin_lock(&mddev->lock);
>> + if (!mddev->in_sync &&
>> + percpu_ref_is_zero(&mddev->writes_pending)) {
>> mddev->in_sync = 1;
>> + /*
>> + * Ensure in_sync is visible before switch back
>> + * to percpu
>> + */
>> smp_mb();
>> - if (atomic_read(&mddev->writes_pending))
>> - /* lost a race with md_write_start() */
>> - mddev->in_sync = 0;
>> set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
>> sysfs_notify_dirent_safe(mddev->sysfs_state);
>> }
>> + if (--mddev->sync_checkers == 0)
>> + percpu_ref_switch_to_percpu(&mddev->writes_pending);
>
> percpu_ref_switch_to_percpu could sleep. Maybe let set_in_sync return a value
> to indicate if caller should drop lock and then do the switch.
percpu_ref_switch_to_percpu() documentation says:
* This function may block if @ref is in the process of switching to atomic
* mode. If the caller ensures that @ref is not in the process of
* switching to atomic mode, this function can be called from any context.
The percpu_ref_switch_to_atomic_sync() will have ensured that a switch
to atomic mode completed, and the use of ->sync_checkers will ensure
that no other mode change has happened. So according to the
documentation, "this function can be called from any context".
Convinced?
>
>> }
>> if (mddev->safemode == 1)
>> mddev->safemode = 0;
>> @@ -5120,6 +5130,7 @@ static void md_free(struct kobject *ko)
>> del_gendisk(mddev->gendisk);
>> put_disk(mddev->gendisk);
>> }
>> + percpu_ref_exit(&mddev->writes_pending);
>>
>> kfree(mddev);
>> }
>> @@ -5145,6 +5156,8 @@ static void mddev_delayed_delete(struct work_struct *ws)
>> kobject_put(&mddev->kobj);
>> }
>>
>> +static void no_op(struct percpu_ref *r) {}
>> +
>> static int md_alloc(dev_t dev, char *name)
>> {
>> static DEFINE_MUTEX(disks_mutex);
>> @@ -5196,6 +5209,10 @@ static int md_alloc(dev_t dev, char *name)
>> blk_queue_make_request(mddev->queue, md_make_request);
>> blk_set_stacking_limits(&mddev->queue->limits);
>>
>> + if (percpu_ref_init(&mddev->writes_pending, no_op, 0, GFP_KERNEL) < 0)
>> + goto abort;
>> + /* We want to start with the refcount at zero */
>> + percpu_ref_put(&mddev->writes_pending);
>> disk = alloc_disk(1 << shift);
>> if (!disk) {
>> blk_cleanup_queue(mddev->queue);
>> @@ -5279,11 +5296,10 @@ static void md_safemode_timeout(unsigned long data)
>> {
>> struct mddev *mddev = (struct mddev *) data;
>>
>> - if (!atomic_read(&mddev->writes_pending)) {
>> - mddev->safemode = 1;
>> - if (mddev->external)
>> - sysfs_notify_dirent_safe(mddev->sysfs_state);
>> - }
>> + mddev->safemode = 1;
>> + if (mddev->external)
>> + sysfs_notify_dirent_safe(mddev->sysfs_state);
>> +
>> md_wakeup_thread(mddev->thread);
>> }
>>
>> @@ -5488,7 +5504,6 @@ int md_run(struct mddev *mddev)
>> } else if (mddev->ro == 2) /* auto-readonly not meaningful */
>> mddev->ro = 0;
>>
>> - atomic_set(&mddev->writes_pending,0);
>> atomic_set(&mddev->max_corr_read_errors,
>> MD_DEFAULT_MAX_CORRECTED_READ_ERRORS);
>> mddev->safemode = 0;
>> @@ -7351,8 +7366,8 @@ void md_wakeup_thread(struct md_thread *thread)
>> {
>> if (thread) {
>> pr_debug("md: waking up MD thread %s.\n", thread->tsk->comm);
>> - set_bit(THREAD_WAKEUP, &thread->flags);
>> - wake_up(&thread->wqueue);
>> + if (!test_and_set_bit(THREAD_WAKEUP, &thread->flags))
>> + wake_up(&thread->wqueue);
>> }
>> }
>> EXPORT_SYMBOL(md_wakeup_thread);
>> @@ -7886,6 +7901,7 @@ EXPORT_SYMBOL(md_done_sync);
>> */
>> void md_write_start(struct mddev *mddev, struct bio *bi)
>> {
>> + unsigned long __percpu *notused;
>> int did_change = 0;
>> if (bio_data_dir(bi) != WRITE)
>> return;
>> @@ -7899,11 +7915,12 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
>> md_wakeup_thread(mddev->sync_thread);
>> did_change = 1;
>> }
>> - atomic_inc(&mddev->writes_pending);
>> + rcu_read_lock();
>> + percpu_ref_get(&mddev->writes_pending);
>> smp_mb(); /* Match smp_mb in set_in_sync() */
>> if (mddev->safemode == 1)
>> mddev->safemode = 0;
>> - if (mddev->in_sync) {
>> + if (mddev->in_sync || !__ref_is_percpu(&mddev->writes_pending, ¬used)) {
>
> this need review from percpu-ref guys. The api doc declares it's internal use only.
I might add a "ref_is_percpu()", which just takes one arg, to
percpu-reference.h and post that with the other percpu-ref changes.
Thanks for the review.
NeilBrown
>
>
> Thanks,
> Shaohua
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [md PATCH 10/15] md/raid1: stop using bi_phys_segment
From: Shaohua Li @ 2017-03-16 3:36 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, hch
In-Reply-To: <8737eedjfu.fsf@notabene.neil.brown.name>
On Thu, Mar 16, 2017 at 01:49:57PM +1100, Neil Brown wrote:
> On Wed, Mar 15 2017, Shaohua Li wrote:
>
> > On Wed, Mar 15, 2017 at 02:05:14PM +1100, Neil Brown wrote:
> >> Change to use bio->__bi_remaining to count number of r1bio attached
> >> to a bio.
> >> See precious raid10 patch for more details.
> >>
> >> Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending
> >> used to measure different things, but were being compared.
> >>
> >> This patch fixes another bug in that nr_pending previously did not
> >> could write-behind requests, so behind writes could continue while
> >> resync was happening. How that nr_pending counts all r1_bio,
> >> the resync cannot commence until the behind writes have completed.
> >>
> >> Signed-off-by: NeilBrown <neilb@suse.com>
> >> ---
> >> drivers/md/raid1.c | 87 +++++++++++++---------------------------------------
> >> 1 file changed, 22 insertions(+), 65 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index 7e509a894f15..e566407b196f 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -246,35 +246,18 @@ static void reschedule_retry(struct r1bio *r1_bio)
> >> static void call_bio_endio(struct r1bio *r1_bio)
> >> {
> >> struct bio *bio = r1_bio->master_bio;
> >> - int done;
> >> struct r1conf *conf = r1_bio->mddev->private;
> >> sector_t bi_sector = bio->bi_iter.bi_sector;
> >>
> >> - if (bio->bi_phys_segments) {
> >> - unsigned long flags;
> >> - spin_lock_irqsave(&conf->device_lock, flags);
> >> - bio->bi_phys_segments--;
> >> - done = (bio->bi_phys_segments == 0);
> >> - spin_unlock_irqrestore(&conf->device_lock, flags);
> >> - /*
> >> - * make_request() might be waiting for
> >> - * bi_phys_segments to decrease
> >> - */
> >> - wake_up(&conf->wait_barrier);
> >> - } else
> >> - done = 1;
> >> -
> >> if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
> >> bio->bi_error = -EIO;
> >>
> >> - if (done) {
> >> - bio_endio(bio);
> >> - /*
> >> - * Wake up any possible resync thread that waits for the device
> >> - * to go idle.
> >> - */
> >> - allow_barrier(conf, bi_sector);
> >> - }
> >> + bio_endio(bio);
> >> + /*
> >> + * Wake up any possible resync thread that waits for the device
> >> + * to go idle.
> >> + */
> >> + allow_barrier(conf, bi_sector);
> >
> > I think this one should be r1_bio->sector instead of master_bio->sector,
> > because multiple r1_bio could be attached to a master_bio. Maybe not change
> > anything, because both sector should be in the same barrier unit, but we'd
> > better to be consistent.
>
> Yes, I agree. Both that it won't make a practical difference and that
> it should be changed.
> I just noticed another little problem with this patch.
> The chunk in handle_read_error() should have added inc_pending()
> near where it added bio_inc_remaining().
>
> Shall I just resend the individual patch (and the raid5 one?).
Please send a fix, I'll integrate it to original patches.
Thanks,
Shaohua
^ permalink raw reply
* [PATCH] md/bitmap:fix a typo in comments of bitmap_read_sb
From: Zhilong Liu @ 2017-03-16 4:26 UTC (permalink / raw)
To: shli; +Cc: linux-raid, Zhilong Liu
In-Reply-To: <1489565693-12595-1-git-send-email-zlliu@suse.com>
bitmap.c: fix a trivial typo in comments of bitmap_read_sb().
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
drivers/md/bitmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 9fb2cca..af9e0f0 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -696,7 +696,7 @@ static int bitmap_read_sb(struct bitmap *bitmap)
out:
kunmap_atomic(sb);
- /* Assiging chunksize is required for "re_read" */
+ /* Assigning chunksize is required for "re_read" */
bitmap->mddev->bitmap_info.chunksize = chunksize;
if (err == 0 && nodes && (bitmap->cluster_slot < 0)) {
err = md_setup_cluster(bitmap->mddev, nodes);
--
2.6.6
^ permalink raw reply related
* [PATCH] md/raid5:fix typo in comments of resize_stripes
From: Zhilong Liu @ 2017-03-16 14:10 UTC (permalink / raw)
To: shli; +Cc: linux-raid, Zhilong Liu
raid5.c: fix typo in comment of resize_stripes()
and delete repeated words.
Signed-off-by: Zhilong Liu <zlliu@suse.com>
---
drivers/md/raid5.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4fb09b3..cb55b3b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2209,7 +2209,7 @@ static int resize_stripes(struct r5conf *conf, int newsize)
* pages have been transferred over, and the old kmem_cache is
* freed when all stripes are done.
* 3/ reallocate conf->disks to be suitable bigger. If this fails,
- * we simple return a failre status - no need to clean anything up.
+ * we simple return a failure status - no need to clean anything up.
* 4/ allocate new pages for the new slots in the new stripe_heads.
* If this fails, we don't bother trying the shrink the
* stripe_heads down again, we just leave them as they are.
@@ -3448,7 +3448,7 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
!test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
/* Pre-reads at not permitted until after short delay
* to gather multiple requests. However if this
- * device is no Insync, the block could only be be computed
+ * device is no Insync, the block could only be computed
* and there is no need to delay that.
*/
return 0;
@@ -3467,7 +3467,7 @@ static int need_this_block(struct stripe_head *sh, struct stripe_head_state *s,
/* If we are forced to do a reconstruct-write, either because
* the current RAID6 implementation only supports that, or
- * or because parity cannot be trusted and we are currently
+ * because parity cannot be trusted and we are currently
* recovering it, there is extra need to be careful.
* If one of the devices that we would need to read, because
* it is not being overwritten (and maybe not written at all)
--
2.6.6
^ permalink raw reply related
* [PATCH v3 00/14] md: cleanup on direct access to bvec table
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In MD's resync I/O path, there are lots of direct access to bio's
bvec table. This patchset kills almost all, and the conversion
is quite straightforward. One root cause of direct access to bvec
table is that resync I/O uses the bio's bvec to manage pages.
In V1, as suggested by Shaohua, a new approach is used to manage
these pages for resync I/O, turns out code becomes more clean
and readable.
Once direct access to bvec table in MD is cleaned up, we may make
multipage bvec moving on.
V3:
- improve pages allocation & bio clone for write behind(10/14)
- retrieve pages from preallocated page array, and avoid to user
bio helpers(14/14)
- change on helpers about managing resync io pages(03/14)
V2:
- remove the patch for introducing/applying bio_remove_last_page()
V1:
- allocate page array to manage resync pages
Thanks,
Ming
Ming Lei (14):
md: raid1/raid10: don't handle failure of bio_add_page()
md: move two macros into md.h
md: prepare for managing resync I/O pages in clean way
md: raid1: simplify r1buf_pool_free()
md: raid1: don't use bio's vec table to manage resync pages
md: raid1: retrieve page from pre-allocated resync page array
md: raid1: use bio helper in process_checks()
block: introduce bio_copy_data_partial
md: raid1: move 'offset' out of loop
md: raid1: improve write behind
md: raid10: refactor code of read reshape's .bi_end_io
md: raid10: don't use bio's vec table to manage resync pages
md: raid10: retrieve page from preallocated resync page array
md: raid10: avoid direct access to bvec table in
handle_reshape_read_error
block/bio.c | 60 +++++++++---
drivers/md/md.h | 55 +++++++++++
drivers/md/raid1.c | 267 ++++++++++++++++++++++++++++------------------------
drivers/md/raid1.h | 10 +-
drivers/md/raid10.c | 224 +++++++++++++++++++++++--------------------
include/linux/bio.h | 2 +
6 files changed, 376 insertions(+), 242 deletions(-)
--
2.9.3
^ permalink raw reply
* [PATCH v3 01/14] md: raid1/raid10: don't handle failure of bio_add_page()
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
All bio_add_page() is for adding one page into resync bio,
which is big enough to hold RESYNC_PAGES pages, and
the current bio_add_page() doesn't check queue limit any more,
so it won't fail at all.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/raid1.c | 21 ++++++---------------
drivers/md/raid10.c | 41 ++++++++++-------------------------------
2 files changed, 16 insertions(+), 46 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fbc2d7851b49..4a0b2ad5025e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2917,21 +2917,12 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
bio = r1_bio->bios[i];
if (bio->bi_end_io) {
page = bio->bi_io_vec[bio->bi_vcnt].bv_page;
- if (bio_add_page(bio, page, len, 0) == 0) {
- /* stop here */
- bio->bi_io_vec[bio->bi_vcnt].bv_page = page;
- while (i > 0) {
- i--;
- bio = r1_bio->bios[i];
- if (bio->bi_end_io==NULL)
- continue;
- /* remove last page from this bio */
- bio->bi_vcnt--;
- bio->bi_iter.bi_size -= len;
- bio_clear_flag(bio, BIO_SEG_VALID);
- }
- goto bio_full;
- }
+
+ /*
+ * won't fail because the vec table is big
+ * enough to hold all these pages
+ */
+ bio_add_page(bio, page, len, 0);
}
}
nr_sectors += len>>9;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0536658c9d40..b4a56a488668 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3437,27 +3437,16 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
if (len == 0)
break;
for (bio= biolist ; bio ; bio=bio->bi_next) {
- struct bio *bio2;
page = bio->bi_io_vec[bio->bi_vcnt].bv_page;
- if (bio_add_page(bio, page, len, 0))
- continue;
-
- /* stop here */
- bio->bi_io_vec[bio->bi_vcnt].bv_page = page;
- for (bio2 = biolist;
- bio2 && bio2 != bio;
- bio2 = bio2->bi_next) {
- /* remove last page from this bio */
- bio2->bi_vcnt--;
- bio2->bi_iter.bi_size -= len;
- bio_clear_flag(bio2, BIO_SEG_VALID);
- }
- goto bio_full;
+ /*
+ * won't fail because the vec table is big enough
+ * to hold all these pages
+ */
+ bio_add_page(bio, page, len, 0);
}
nr_sectors += len>>9;
sector_nr += len>>9;
} while (biolist->bi_vcnt < RESYNC_PAGES);
- bio_full:
r10_bio->sectors = nr_sectors;
while (biolist) {
@@ -4530,25 +4519,15 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
if (len > PAGE_SIZE)
len = PAGE_SIZE;
for (bio = blist; bio ; bio = bio->bi_next) {
- struct bio *bio2;
- if (bio_add_page(bio, page, len, 0))
- continue;
-
- /* Didn't fit, must stop */
- for (bio2 = blist;
- bio2 && bio2 != bio;
- bio2 = bio2->bi_next) {
- /* Remove last page from this bio */
- bio2->bi_vcnt--;
- bio2->bi_iter.bi_size -= len;
- bio_clear_flag(bio2, BIO_SEG_VALID);
- }
- goto bio_full;
+ /*
+ * won't fail because the vec table is big enough
+ * to hold all these pages
+ */
+ bio_add_page(bio, page, len, 0);
}
sector_nr += len >> 9;
nr_sectors += len >> 9;
}
-bio_full:
rcu_read_unlock();
r10_bio->sectors = nr_sectors;
--
2.9.3
^ permalink raw reply related
* [PATCH v3 02/14] md: move two macros into md.h
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
Both raid1 and raid10 share common resync
block size and page count, so move them into md.h.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/md.h | 5 +++++
drivers/md/raid1.c | 2 --
drivers/md/raid10.c | 3 ---
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b8859cbf84b6..1d63239a1be4 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -715,4 +715,9 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
!bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors)
mddev->queue->limits.max_write_same_sectors = 0;
}
+
+/* Maximum size of each resync request */
+#define RESYNC_BLOCK_SIZE (64*1024)
+#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
+
#endif /* _MD_MD_H */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4a0b2ad5025e..908e2caeb704 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -94,10 +94,8 @@ static void r1bio_pool_free(void *r1_bio, void *data)
kfree(r1_bio);
}
-#define RESYNC_BLOCK_SIZE (64*1024)
#define RESYNC_DEPTH 32
#define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)
-#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
#define RESYNC_WINDOW (RESYNC_BLOCK_SIZE * RESYNC_DEPTH)
#define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
#define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b4a56a488668..2b40420299e3 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -125,9 +125,6 @@ static void r10bio_pool_free(void *r10_bio, void *data)
kfree(r10_bio);
}
-/* Maximum size of each resync request */
-#define RESYNC_BLOCK_SIZE (64*1024)
-#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
/* amount of memory to reserve for resync requests */
#define RESYNC_WINDOW (1024*1024)
/* maximum number of concurrent requests, memory permitting */
--
2.9.3
^ permalink raw reply related
* [PATCH v3 03/14] md: prepare for managing resync I/O pages in clean way
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
Now resync I/O use bio's bec table to manage pages,
this way is very hacky, and may not work any more
once multipage bvec is introduced.
So introduce helpers and new data structure for
managing resync I/O pages more cleanly.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/md.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1d63239a1be4..20c48032493b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -720,4 +720,54 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
#define RESYNC_BLOCK_SIZE (64*1024)
#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
+/* for managing resync I/O pages */
+struct resync_pages {
+ unsigned idx; /* for get/put page from the pool */
+ void *raid_bio;
+ struct page *pages[RESYNC_PAGES];
+};
+
+static inline int resync_alloc_pages(struct resync_pages *rp,
+ gfp_t gfp_flags)
+{
+ int i;
+
+ for (i = 0; i < RESYNC_PAGES; i++) {
+ rp->pages[i] = alloc_page(gfp_flags);
+ if (!rp->pages[i])
+ goto out_free;
+ }
+
+ return 0;
+
+ out_free:
+ while (--i >= 0)
+ put_page(rp->pages[i]);
+ return -ENOMEM;
+}
+
+static inline void resync_free_pages(struct resync_pages *rp)
+{
+ int i;
+
+ for (i = 0; i < RESYNC_PAGES; i++)
+ put_page(rp->pages[i]);
+}
+
+static inline void resync_get_all_pages(struct resync_pages *rp)
+{
+ int i;
+
+ for (i = 0; i < RESYNC_PAGES; i++)
+ get_page(rp->pages[i]);
+}
+
+static inline struct page *resync_fetch_page(struct resync_pages *rp,
+ unsigned idx)
+{
+ if (WARN_ON_ONCE(idx >= RESYNC_PAGES))
+ return NULL;
+ return rp->pages[idx];
+}
+
#endif /* _MD_MD_H */
--
2.9.3
^ permalink raw reply related
* [PATCH v3 04/14] md: raid1: simplify r1buf_pool_free()
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
This patch gets each page's reference of each bio for resync,
then r1buf_pool_free() gets simplified a lot.
The same policy has been taken in raid10's buf pool allocation/free
too.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/raid1.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 908e2caeb704..e30d89690109 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -142,9 +142,12 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
/* If not user-requests, copy the page pointers to all bios */
if (!test_bit(MD_RECOVERY_REQUESTED, &pi->mddev->recovery)) {
for (i=0; i<RESYNC_PAGES ; i++)
- for (j=1; j<pi->raid_disks; j++)
- r1_bio->bios[j]->bi_io_vec[i].bv_page =
+ for (j=1; j<pi->raid_disks; j++) {
+ struct page *page =
r1_bio->bios[0]->bi_io_vec[i].bv_page;
+ get_page(page);
+ r1_bio->bios[j]->bi_io_vec[i].bv_page = page;
+ }
}
r1_bio->master_bio = NULL;
@@ -169,12 +172,8 @@ static void r1buf_pool_free(void *__r1_bio, void *data)
struct r1bio *r1bio = __r1_bio;
for (i = 0; i < RESYNC_PAGES; i++)
- for (j = pi->raid_disks; j-- ;) {
- if (j == 0 ||
- r1bio->bios[j]->bi_io_vec[i].bv_page !=
- r1bio->bios[0]->bi_io_vec[i].bv_page)
- safe_put_page(r1bio->bios[j]->bi_io_vec[i].bv_page);
- }
+ for (j = pi->raid_disks; j-- ;)
+ safe_put_page(r1bio->bios[j]->bi_io_vec[i].bv_page);
for (i=0 ; i < pi->raid_disks; i++)
bio_put(r1bio->bios[i]);
--
2.9.3
^ permalink raw reply related
* [PATCH v3 05/14] md: raid1: don't use bio's vec table to manage resync pages
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
Now we allocate one page array for managing resync pages, instead
of using bio's vec table to do that, and the old way is very hacky
and won't work any more if multipage bvec is enabled.
The introduced cost is that we need to allocate (128 + 16) * raid_disks
bytes per r1_bio, and it is fine because the inflight r1_bio for
resync shouldn't be much, as pointed by Shaohua.
Also the bio_reset() in raid1_sync_request() is removed because
all bios are freshly new now and not necessary to reset any more.
This patch can be thought as a cleanup too
Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/raid1.c | 94 +++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 64 insertions(+), 30 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e30d89690109..0e64beb60e4d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -80,6 +80,24 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
#define raid1_log(md, fmt, args...) \
do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
+/*
+ * 'strct resync_pages' stores actual pages used for doing the resync
+ * IO, and it is per-bio, so make .bi_private points to it.
+ */
+static inline struct resync_pages *get_resync_pages(struct bio *bio)
+{
+ return bio->bi_private;
+}
+
+/*
+ * for resync bio, r1bio pointer can be retrieved from the per-bio
+ * 'struct resync_pages'.
+ */
+static inline struct r1bio *get_resync_r1bio(struct bio *bio)
+{
+ return get_resync_pages(bio)->raid_bio;
+}
+
static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
{
struct pool_info *pi = data;
@@ -107,12 +125,18 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
struct r1bio *r1_bio;
struct bio *bio;
int need_pages;
- int i, j;
+ int j;
+ struct resync_pages *rps;
r1_bio = r1bio_pool_alloc(gfp_flags, pi);
if (!r1_bio)
return NULL;
+ rps = kmalloc(sizeof(struct resync_pages) * pi->raid_disks,
+ gfp_flags);
+ if (!rps)
+ goto out_free_r1bio;
+
/*
* Allocate bios : 1 for reading, n-1 for writing
*/
@@ -132,22 +156,22 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
need_pages = pi->raid_disks;
else
need_pages = 1;
- for (j = 0; j < need_pages; j++) {
+ for (j = 0; j < pi->raid_disks; j++) {
+ struct resync_pages *rp = &rps[j];
+
bio = r1_bio->bios[j];
- bio->bi_vcnt = RESYNC_PAGES;
-
- if (bio_alloc_pages(bio, gfp_flags))
- goto out_free_pages;
- }
- /* If not user-requests, copy the page pointers to all bios */
- if (!test_bit(MD_RECOVERY_REQUESTED, &pi->mddev->recovery)) {
- for (i=0; i<RESYNC_PAGES ; i++)
- for (j=1; j<pi->raid_disks; j++) {
- struct page *page =
- r1_bio->bios[0]->bi_io_vec[i].bv_page;
- get_page(page);
- r1_bio->bios[j]->bi_io_vec[i].bv_page = page;
- }
+
+ if (j < need_pages) {
+ if (resync_alloc_pages(rp, gfp_flags))
+ goto out_free_pages;
+ } else {
+ memcpy(rp, &rps[0], sizeof(*rp));
+ resync_get_all_pages(rp);
+ }
+
+ rp->idx = 0;
+ rp->raid_bio = r1_bio;
+ bio->bi_private = rp;
}
r1_bio->master_bio = NULL;
@@ -156,11 +180,14 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
out_free_pages:
while (--j >= 0)
- bio_free_pages(r1_bio->bios[j]);
+ resync_free_pages(&rps[j]);
out_free_bio:
while (++j < pi->raid_disks)
bio_put(r1_bio->bios[j]);
+ kfree(rps);
+
+out_free_r1bio:
r1bio_pool_free(r1_bio, data);
return NULL;
}
@@ -168,14 +195,18 @@ static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
static void r1buf_pool_free(void *__r1_bio, void *data)
{
struct pool_info *pi = data;
- int i,j;
+ int i;
struct r1bio *r1bio = __r1_bio;
+ struct resync_pages *rp = NULL;
- for (i = 0; i < RESYNC_PAGES; i++)
- for (j = pi->raid_disks; j-- ;)
- safe_put_page(r1bio->bios[j]->bi_io_vec[i].bv_page);
- for (i=0 ; i < pi->raid_disks; i++)
+ for (i = pi->raid_disks; i--; ) {
+ rp = get_resync_pages(r1bio->bios[i]);
+ resync_free_pages(rp);
bio_put(r1bio->bios[i]);
+ }
+
+ /* resync pages array stored in the 1st bio's .bi_private */
+ kfree(rp);
r1bio_pool_free(r1bio, data);
}
@@ -1863,7 +1894,7 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
static void end_sync_read(struct bio *bio)
{
- struct r1bio *r1_bio = bio->bi_private;
+ struct r1bio *r1_bio = get_resync_r1bio(bio);
update_head_pos(r1_bio->read_disk, r1_bio);
@@ -1882,7 +1913,7 @@ static void end_sync_read(struct bio *bio)
static void end_sync_write(struct bio *bio)
{
int uptodate = !bio->bi_error;
- struct r1bio *r1_bio = bio->bi_private;
+ struct r1bio *r1_bio = get_resync_r1bio(bio);
struct mddev *mddev = r1_bio->mddev;
struct r1conf *conf = mddev->private;
sector_t first_bad;
@@ -2099,6 +2130,7 @@ static void process_checks(struct r1bio *r1_bio)
int size;
int error;
struct bio *b = r1_bio->bios[i];
+ struct resync_pages *rp = get_resync_pages(b);
if (b->bi_end_io != end_sync_read)
continue;
/* fixup the bio for reuse, but preserve errno */
@@ -2111,7 +2143,8 @@ static void process_checks(struct r1bio *r1_bio)
conf->mirrors[i].rdev->data_offset;
b->bi_bdev = conf->mirrors[i].rdev->bdev;
b->bi_end_io = end_sync_read;
- b->bi_private = r1_bio;
+ rp->raid_bio = r1_bio;
+ b->bi_private = rp;
size = b->bi_iter.bi_size;
for (j = 0; j < vcnt ; j++) {
@@ -2769,7 +2802,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
for (i = 0; i < conf->raid_disks * 2; i++) {
struct md_rdev *rdev;
bio = r1_bio->bios[i];
- bio_reset(bio);
rdev = rcu_dereference(conf->mirrors[i].rdev);
if (rdev == NULL ||
@@ -2825,7 +2857,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
atomic_inc(&rdev->nr_pending);
bio->bi_iter.bi_sector = sector_nr + rdev->data_offset;
bio->bi_bdev = rdev->bdev;
- bio->bi_private = r1_bio;
if (test_bit(FailFast, &rdev->flags))
bio->bi_opf |= MD_FAILFAST;
}
@@ -2911,9 +2942,12 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
}
for (i = 0 ; i < conf->raid_disks * 2; i++) {
+ struct resync_pages *rp;
+
bio = r1_bio->bios[i];
+ rp = get_resync_pages(bio);
if (bio->bi_end_io) {
- page = bio->bi_io_vec[bio->bi_vcnt].bv_page;
+ page = resync_fetch_page(rp, rp->idx++);
/*
* won't fail because the vec table is big
@@ -2925,8 +2959,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
nr_sectors += len>>9;
sector_nr += len>>9;
sync_blocks -= (len>>9);
- } while (r1_bio->bios[disk]->bi_vcnt < RESYNC_PAGES);
- bio_full:
+ } while (get_resync_pages(r1_bio->bios[disk]->bi_private)->idx < RESYNC_PAGES);
+
r1_bio->sectors = nr_sectors;
if (mddev_is_clustered(mddev) &&
--
2.9.3
^ permalink raw reply related
* [PATCH v3 06/14] md: raid1: retrieve page from pre-allocated resync page array
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
Now one page array is allocated for each resync bio, and we can
retrieve page from this table directly.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/raid1.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0e64beb60e4d..b1345aa4efd8 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1992,6 +1992,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
struct mddev *mddev = r1_bio->mddev;
struct r1conf *conf = mddev->private;
struct bio *bio = r1_bio->bios[r1_bio->read_disk];
+ struct page **pages = get_resync_pages(bio)->pages;
sector_t sect = r1_bio->sector;
int sectors = r1_bio->sectors;
int idx = 0;
@@ -2025,7 +2026,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
*/
rdev = conf->mirrors[d].rdev;
if (sync_page_io(rdev, sect, s<<9,
- bio->bi_io_vec[idx].bv_page,
+ pages[idx],
REQ_OP_READ, 0, false)) {
success = 1;
break;
@@ -2080,7 +2081,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
continue;
rdev = conf->mirrors[d].rdev;
if (r1_sync_page_io(rdev, sect, s,
- bio->bi_io_vec[idx].bv_page,
+ pages[idx],
WRITE) == 0) {
r1_bio->bios[d]->bi_end_io = NULL;
rdev_dec_pending(rdev, mddev);
@@ -2095,7 +2096,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
continue;
rdev = conf->mirrors[d].rdev;
if (r1_sync_page_io(rdev, sect, s,
- bio->bi_io_vec[idx].bv_page,
+ pages[idx],
READ) != 0)
atomic_add(s, &rdev->corrected_errors);
}
@@ -2171,6 +2172,8 @@ static void process_checks(struct r1bio *r1_bio)
struct bio *pbio = r1_bio->bios[primary];
struct bio *sbio = r1_bio->bios[i];
int error = sbio->bi_error;
+ struct page **ppages = get_resync_pages(pbio)->pages;
+ struct page **spages = get_resync_pages(sbio)->pages;
if (sbio->bi_end_io != end_sync_read)
continue;
@@ -2179,11 +2182,8 @@ static void process_checks(struct r1bio *r1_bio)
if (!error) {
for (j = vcnt; j-- ; ) {
- struct page *p, *s;
- p = pbio->bi_io_vec[j].bv_page;
- s = sbio->bi_io_vec[j].bv_page;
- if (memcmp(page_address(p),
- page_address(s),
+ if (memcmp(page_address(ppages[j]),
+ page_address(spages[j]),
sbio->bi_io_vec[j].bv_len))
break;
}
--
2.9.3
^ permalink raw reply related
* [PATCH v3 07/14] md: raid1: use bio helper in process_checks()
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
Avoid to direct access to bvec table.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/raid1.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index b1345aa4efd8..4034a2963da8 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2130,6 +2130,7 @@ static void process_checks(struct r1bio *r1_bio)
int j;
int size;
int error;
+ struct bio_vec *bi;
struct bio *b = r1_bio->bios[i];
struct resync_pages *rp = get_resync_pages(b);
if (b->bi_end_io != end_sync_read)
@@ -2148,9 +2149,7 @@ static void process_checks(struct r1bio *r1_bio)
b->bi_private = rp;
size = b->bi_iter.bi_size;
- for (j = 0; j < vcnt ; j++) {
- struct bio_vec *bi;
- bi = &b->bi_io_vec[j];
+ bio_for_each_segment_all(bi, b, j) {
bi->bv_offset = 0;
if (size > PAGE_SIZE)
bi->bv_len = PAGE_SIZE;
@@ -2174,17 +2173,22 @@ static void process_checks(struct r1bio *r1_bio)
int error = sbio->bi_error;
struct page **ppages = get_resync_pages(pbio)->pages;
struct page **spages = get_resync_pages(sbio)->pages;
+ struct bio_vec *bi;
+ int page_len[RESYNC_PAGES];
if (sbio->bi_end_io != end_sync_read)
continue;
/* Now we can 'fixup' the error value */
sbio->bi_error = 0;
+ bio_for_each_segment_all(bi, sbio, j)
+ page_len[j] = bi->bv_len;
+
if (!error) {
for (j = vcnt; j-- ; ) {
if (memcmp(page_address(ppages[j]),
page_address(spages[j]),
- sbio->bi_io_vec[j].bv_len))
+ page_len[j]))
break;
}
} else
--
2.9.3
^ permalink raw reply related
* [PATCH v3 08/14] block: introduce bio_copy_data_partial
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
Turns out we can use bio_copy_data in raid1's write behind,
and we can make alloc_behind_pages() more clean/efficient,
but we need to partial version of bio_copy_data().
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
block/bio.c | 60 +++++++++++++++++++++++++++++++++++++++++------------
include/linux/bio.h | 2 ++
2 files changed, 49 insertions(+), 13 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index e75878f8b14a..1ccff0dace89 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1025,19 +1025,8 @@ int bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
}
EXPORT_SYMBOL(bio_alloc_pages);
-/**
- * bio_copy_data - copy contents of data buffers from one chain of bios to
- * another
- * @src: source bio list
- * @dst: destination bio list
- *
- * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
- * @src and @dst as linked lists of bios.
- *
- * Stops when it reaches the end of either @src or @dst - that is, copies
- * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
- */
-void bio_copy_data(struct bio *dst, struct bio *src)
+static void __bio_copy_data(struct bio *dst, struct bio *src,
+ int offset, int size)
{
struct bvec_iter src_iter, dst_iter;
struct bio_vec src_bv, dst_bv;
@@ -1047,6 +1036,12 @@ void bio_copy_data(struct bio *dst, struct bio *src)
src_iter = src->bi_iter;
dst_iter = dst->bi_iter;
+ /* for supporting partial copy */
+ if (offset || size != src->bi_iter.bi_size) {
+ bio_advance_iter(src, &src_iter, offset);
+ src_iter.bi_size = size;
+ }
+
while (1) {
if (!src_iter.bi_size) {
src = src->bi_next;
@@ -1083,8 +1078,47 @@ void bio_copy_data(struct bio *dst, struct bio *src)
bio_advance_iter(dst, &dst_iter, bytes);
}
}
+
+/**
+ * bio_copy_data - copy contents of data buffers from one chain of bios to
+ * another
+ * @src: source bio list
+ * @dst: destination bio list
+ *
+ * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
+ * @src and @dst as linked lists of bios.
+ *
+ * Stops when it reaches the end of either @src or @dst - that is, copies
+ * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
+ */
+void bio_copy_data(struct bio *dst, struct bio *src)
+{
+ __bio_copy_data(dst, src, 0, src->bi_iter.bi_size);
+}
EXPORT_SYMBOL(bio_copy_data);
+/**
+ * bio_copy_data_partial - copy partial contents of data buffers from one
+ * chain of bios to another
+ * @dst: destination bio list
+ * @src: source bio list
+ * @offset: starting copy from the offset
+ * @size: how many bytes to copy
+ *
+ * If @src and @dst are single bios, bi_next must be NULL - otherwise, treats
+ * @src and @dst as linked lists of bios.
+ *
+ * Stops when it reaches the end of either @src or @dst - that is, copies
+ * min(src->bi_size, dst->bi_size) bytes (or the equivalent for lists of bios).
+ */
+void bio_copy_data_partial(struct bio *dst, struct bio *src,
+ int offset, int size)
+{
+ __bio_copy_data(dst, src, offset, size);
+
+}
+EXPORT_SYMBOL(bio_copy_data_partial);
+
struct bio_map_data {
int is_our_pages;
struct iov_iter iter;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..42b62a0288b0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -468,6 +468,8 @@ static inline void bio_flush_dcache_pages(struct bio *bi)
#endif
extern void bio_copy_data(struct bio *dst, struct bio *src);
+extern void bio_copy_data_partial(struct bio *dst, struct bio *src,
+ int offset, int size);
extern int bio_alloc_pages(struct bio *bio, gfp_t gfp);
extern void bio_free_pages(struct bio *bio);
--
2.9.3
^ permalink raw reply related
* [PATCH v3 09/14] md: raid1: move 'offset' out of loop
From: Ming Lei @ 2017-03-16 16:12 UTC (permalink / raw)
To: Shaohua Li, Jens Axboe, linux-raid, linux-block,
Christoph Hellwig; +Cc: Ming Lei
In-Reply-To: <20170316161235.27110-1-tom.leiming@gmail.com>
The 'offset' local variable can't be changed inside the loop, so
move it out.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
drivers/md/raid1.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4034a2963da8..2f3622c695ce 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1317,6 +1317,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
int first_clone;
int sectors_handled;
int max_sectors;
+ sector_t offset;
/*
* Register the new request and wait if the reconstruction
@@ -1481,13 +1482,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio)
atomic_set(&r1_bio->behind_remaining, 0);
first_clone = 1;
+
+ offset = r1_bio->sector - bio->bi_iter.bi_sector;
for (i = 0; i < disks; i++) {
struct bio *mbio = NULL;
- sector_t offset;
if (!r1_bio->bios[i])
continue;
- offset = r1_bio->sector - bio->bi_iter.bi_sector;
if (first_clone) {
/* do behind I/O ?
--
2.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox