From: Shaohua Li <shli@kernel.org>
To: NeilBrown <neilb@suse.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: [PATCH] md: use a separate bio_set for synchronous IO.
Date: Wed, 21 Jun 2017 10:49:16 -0700 [thread overview]
Message-ID: <20170621174916.mpxddat2vbiolrik@kernel.org> (raw)
In-Reply-To: <87wp861cfe.fsf@notabene.neil.brown.name>
On Wed, Jun 21, 2017 at 09:12:21AM +1000, Neil Brown wrote:
>
> md devices allocate a bio_set and use it for two
> distinct purposes.
> mddev->bio_set is used to clone bios as part of sending
> upper level requests down to lower level devices,
> and it is also use for synchronous IO such as superblock
> and bitmap updates, and for correcting read errors.
>
> This multiple usage can lead to deadlocks. It is likely
> that cloned bios might be queued for write and to be
> waiting for a metadata update before the write can be permitted.
> If the cloning exhausted mddev->bio_set, the metadata update
> may not be able to proceed.
>
> This scenario has been seen during heavy testing, with lots of IO and
> lots of memory pressure.
>
> Address this by adding a new bio_set specifically for synchronous IO.
> All synchronous IO goes directly to the underlying device and is not
> queued at the md level, so request using entries from the new
> mddev->sync_set will complete in a timely fashion.
> Requests that use mddev->bio_set will sometimes need to wait
> for synchronous IO, but will no longer risk deadlocking that iO.
>
> Also: small simplification in mddev_put(): there is no need to
> wait until the spinlock is released before calling bioset_free().
applied, thanks
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> drivers/md/md.c | 27 ++++++++++++++++++++-------
> drivers/md/md.h | 3 +++
> 2 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 87edc342ccb3..9d0eb50c458e 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -203,6 +203,14 @@ struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
> }
> EXPORT_SYMBOL_GPL(bio_alloc_mddev);
>
> +static struct bio *md_bio_alloc_sync(struct mddev *mddev)
> +{
> + if (!mddev->sync_set)
> + return bio_alloc(GFP_NOIO, 1);
> +
> + return bio_alloc_bioset(GFP_NOIO, 1, mddev->sync_set);
> +}
> +
> /*
> * We have a system wide 'event count' that is incremented
> * on any 'interesting' event, and readers of /proc/mdstat
> @@ -462,8 +470,6 @@ static void mddev_delayed_delete(struct work_struct *ws);
>
> static void mddev_put(struct mddev *mddev)
> {
> - struct bio_set *bs = NULL;
> -
> if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
> return;
> if (!mddev->raid_disks && list_empty(&mddev->disks) &&
> @@ -471,8 +477,12 @@ static void mddev_put(struct mddev *mddev)
> /* Array is not configured at all, and not held active,
> * so destroy it */
> list_del_init(&mddev->all_mddevs);
> - bs = mddev->bio_set;
> + if (mddev->bio_set)
> + bioset_free(mddev->bio_set);
> + if (mddev->sync_set)
> + bioset_free(mddev->sync_set);
> mddev->bio_set = NULL;
> + mddev->sync_set = NULL;
> if (mddev->gendisk) {
> /* We did a probe so need to clean up. Call
> * queue_work inside the spinlock so that
> @@ -485,8 +495,6 @@ static void mddev_put(struct mddev *mddev)
> kfree(mddev);
> }
> spin_unlock(&all_mddevs_lock);
> - if (bs)
> - bioset_free(bs);
> }
>
> static void md_safemode_timeout(unsigned long data);
> @@ -751,7 +759,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
> if (test_bit(Faulty, &rdev->flags))
> return;
>
> - bio = bio_alloc_mddev(GFP_NOIO, 1, mddev);
> + bio = md_bio_alloc_sync(mddev);
>
> atomic_inc(&rdev->nr_pending);
>
> @@ -783,7 +791,7 @@ int md_super_wait(struct mddev *mddev)
> int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
> struct page *page, int op, int op_flags, bool metadata_op)
> {
> - struct bio *bio = bio_alloc_mddev(GFP_NOIO, 1, rdev->mddev);
> + struct bio *bio = md_bio_alloc_sync(rdev->mddev);
> int ret;
>
> bio->bi_bdev = (metadata_op && rdev->meta_bdev) ?
> @@ -5432,6 +5440,11 @@ int md_run(struct mddev *mddev)
> if (!mddev->bio_set)
> return -ENOMEM;
> }
> + if (mddev->sync_set == NULL) {
> + mddev->sync_set = bioset_create(BIO_POOL_SIZE, 0);
> + if (!mddev->sync_set)
> + return -ENOMEM;
> + }
>
> spin_lock(&pers_lock);
> pers = find_pers(mddev->level, mddev->clevel);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 0fa1de42c42b..a1c7aa89bb67 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -444,6 +444,9 @@ struct mddev {
> struct attribute_group *to_remove;
>
> struct bio_set *bio_set;
> + struct bio_set *sync_set; /* for sync operations like
> + * metadata and bitmap writes
> + */
>
> /* Generic flush handling.
> * The last to finish preflush schedules a worker to submit
> --
> 2.12.2
>
prev parent reply other threads:[~2017-06-21 17:49 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-20 23:12 [PATCH] md: use a separate bio_set for synchronous IO NeilBrown
2017-06-21 17:49 ` Shaohua Li [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170621174916.mpxddat2vbiolrik@kernel.org \
--to=shli@kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).