From: "Darrick J. Wong" <djwong@us.ibm.com>
To: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>, "Theodore Ts'o" <tytso@mit.edu>,
Neil Brown <neilb@suse.de>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Jan Kara <jack@suse.cz>, Mike Snitzer <snitzer@redhat.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
Keith Mannthey <kmannth@us.ibm.com>,
Mingming Cao <cmm@us.ibm.com>,
linux-ext4@vger.kernel.org, Ric Wheeler <rwheeler@redhat.com>,
Christoph Hellwig <hch@lst.de>, Josef Bacik <josef@redhat.com>
Subject: Re: [PATCH v7.1] block: Coordinate flush requests
Date: Thu, 13 Jan 2011 10:59:46 -0800 [thread overview]
Message-ID: <20110113185946.GD27381@tux1.beaverton.ibm.com> (raw)
In-Reply-To: <20110113104240.GA30719@htj.dyndns.org>
On Thu, Jan 13, 2011 at 11:42:40AM +0100, Tejun Heo wrote:
> Hello, Darrick.
>
> On Wed, Jan 12, 2011 at 06:56:46PM -0800, Darrick J. Wong wrote:
> > On certain types of storage hardware, flushing the write cache takes a
> > considerable amount of time. Typically, these are simple storage systems with
> > write cache enabled and no battery to save that cache during a power failure.
> > When we encounter a system with many I/O threads that try to flush the cache,
> > performance is suboptimal because each of those threads issues its own flush
> > command to the drive instead of trying to coordinate the flushes, thereby
> > wasting execution time.
> >
> > Instead of each thread initiating its own flush, we now try to detect the
> > situation where multiple threads are issuing flush requests. The first thread
> > to enter blkdev_issue_flush becomes the owner of the flush, and all threads
> > that enter blkdev_issue_flush before the flush finishes are queued up to wait
> > for the next flush. When that first flush finishes, one of those sleeping
> > threads is woken up to perform the next flush and then wake up the other
> > threads which are asleep waiting for the second flush to finish.
>
> Nice work. :-)
Thank you!
> > block/blk-flush.c | 137 +++++++++++++++++++++++++++++++++++++++++--------
> > block/genhd.c | 12 ++++
> > include/linux/genhd.h | 15 +++++
> > 3 files changed, 143 insertions(+), 21 deletions(-)
> >
> > diff --git a/block/blk-flush.c b/block/blk-flush.c
> > index 2402a34..d6c9931 100644
> > --- a/block/blk-flush.c
> > +++ b/block/blk-flush.c
> > @@ -201,6 +201,60 @@ static void bio_end_flush(struct bio *bio, int err)
> > bio_put(bio);
> > }
> >
> > +static int blkdev_issue_flush_now(struct block_device *bdev,
> > + gfp_t gfp_mask, sector_t *error_sector)
> > +{
> > + DECLARE_COMPLETION_ONSTACK(wait);
> > + struct bio *bio;
> > + int ret = 0;
> > +
> > + bio = bio_alloc(gfp_mask, 0);
> > + bio->bi_end_io = bio_end_flush;
> > + bio->bi_bdev = bdev;
> > + bio->bi_private = &wait;
> > +
> > + bio_get(bio);
> > + submit_bio(WRITE_FLUSH, bio);
> > + wait_for_completion(&wait);
> > +
> > + /*
> > + * The driver must store the error location in ->bi_sector, if
> > + * it supports it. For non-stacked drivers, this should be
> > + * copied from blk_rq_pos(rq).
> > + */
> > + if (error_sector)
> > + *error_sector = bio->bi_sector;
> > +
> > + if (!bio_flagged(bio, BIO_UPTODATE))
> > + ret = -EIO;
> > +
> > + bio_put(bio);
> > + return ret;
> > +}
>
> But wouldn't it be better to implement this directly in the flush
> machinary instead of as blkdev_issue_flush() wrapper? We have all the
> information at the request queue level so we can easily detect whether
> flushes can be merged or not and whether something is issued by
> blkdev_issue_flush() or by directly submitting bio wouldn't matter at
> all. Am I missing something?
Could you clarify where exactly you meant by "in the flush machinery"? I think
what you're suggesting is that blk_flush_complete_seq could scan the pending
flush list to find a run of consecutive pure flush requests, submit the last
one, and set things up so that during the completion of the flush, all the
requests in that run are returned with the real flush's return code.
If that's what you meant, then yes, it could be done that way. However, I have
a few reasons for choosing the blkdev_issue_flush approach. First, as far as I
could tell in the kernel, all the code paths that involve upper layers issuing
pure flushes go through blkdev_issue_flush, so it seems like a convenient place
to capture all the incoming pure flushes. Other parts of the kernel issue
writes with the flush flag set, but we want those to go through the regular
queuing mechanisms anyway. Second, with the proposed patch, any upper-layer
code that has a requirement for a pure flush to be issued without coordination
can do so by submitting the bio directly (or blkdev_issue_flush_now can be
exported). Third, baking the coordination into the flush machinery makes that
machinery more complicated to understand and maintain.
So in short, I went with the blkdev_issue_flush approach because the code is
easier to understand, and it's not losing any pure flushes despite casting a
narrower net.
I was also wondering, how many others are testing these flush-pain-reduction
patches? It would be nice to know about wider testing than just my lab. :)
--D
next prev parent reply other threads:[~2011-01-13 18:59 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-13 2:56 [PATCH v7.1] block: Coordinate flush requests Darrick J. Wong
2011-01-13 5:38 ` Shaohua Li
2011-01-13 7:46 ` Darrick J. Wong
2011-01-14 1:00 ` Shaohua Li
2011-01-14 21:00 ` Darrick J. Wong
2011-01-15 17:32 ` Tejun Heo
2011-01-18 1:12 ` Shaohua Li
2011-01-20 18:50 ` Tejun Heo
2011-01-20 19:13 ` Tejun Heo
2011-01-21 6:41 ` Darrick J. Wong
2011-01-13 10:42 ` Tejun Heo
2011-01-13 18:59 ` Darrick J. Wong [this message]
2011-01-15 13:33 ` Tejun Heo
2011-01-19 8:58 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110113185946.GD27381@tux1.beaverton.ibm.com \
--to=djwong@us.ibm.com \
--cc=adilger.kernel@dilger.ca \
--cc=axboe@kernel.dk \
--cc=cmm@us.ibm.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=josef@redhat.com \
--cc=kmannth@us.ibm.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=neilb@suse.de \
--cc=rwheeler@redhat.com \
--cc=snitzer@redhat.com \
--cc=tj@kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).