Re: [rfc] fsync_range? - Jamie Lokier

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jamie Lokier <jamie@shareable.org>
To: Bryan Henderson <hbryan@us.ibm.com>
Cc: linux-fsdevel@vger.kernel.org, Nick Piggin <npiggin@suse.de>
Subject: Re: [rfc] fsync_range?
Date: Wed, 21 Jan 2009 22:30:03 +0000	[thread overview]
Message-ID: <20090121223002.GJ16133@shareable.org> (raw)
In-Reply-To: <OF189BAEF9.2F41F65A-ON88257545.0078639B-88257545.007A2F05@us.ibm.com>

Bryan Henderson wrote:
> > If you have 100 file regions, each one a few pages in size, and you do
> > 100 fsync_range() calls, that results in potentally far from optimal
> > I/O scheduling (e.g. all over the disk) *and* 100 low-level disk cache
> > flushes (I/O barriers) instead of just one at the end.  100 head seeks
> > and 100 cache flush ops can be very expensive.
> 
> You got lost in the thread here.  I proposed a fadvise() that would result 
> in I/O scheduling; Nick said the fadvise() might have to block; I said so 
> what?  Now you seem to be talking about 100 fsync_range() calls, each of 
> which starts and then waits for a sync of one range.
>
> Getting back to I/O scheduled as a result of an fadvise(): if it blocks 
> because the block queue is full, then it's going to block with a 
> multi-range fsync_range() as well.

No, why would it block?  The block queue has room for (say) 100 small
file ranges.  If you submit 1000 ranges, sure the first 900 may block,
then you've got 100 left in the queue.

Then you call fsync_range() 1000 times, the first 900 are NOPs as you
say because the data has been written.  The remaining 100 (size of the
block queue) are forced to write serially.  They're even written to
the disk platter in order.

> My fadvise-based proposal waits for I/O only after it has all been 
> submitted.

Are you saying one call to fsync_range() should wait for all the
writes which have been queued by the fadvice to different ranges?

> But plugging (delaying the start of I/O even though it is ready to go and 
> the device is idle) is rarely a good idea.  It can help for short bursts 
> to a mostly idle device (typically saves half a seek per burst), but a 
> busy device provides a natural plug.  It thus can't help throughput, but 
> can improve the response time of a burst.

I agree, plugging doesn't make a big difference.

However, letting the disk or elevator reorder the writes it has room
for does sometimes make a big difference.  That's the point.  We're
not talking about forcibly _delaying_ I/O, we're talking about giving
the block elevator, and disk's own elevator, freedom to do their job
by not forcibly _flushing_ and _waiting_ between each individual
request for the length of the queue.

-- Jamie

next prev parent reply	other threads:[~2009-01-21 22:30 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-20 16:47 [rfc] fsync_range? Nick Piggin
2009-01-20 18:31 ` Jamie Lokier
2009-01-20 21:25   ` Bryan Henderson
2009-01-20 22:42     ` Jamie Lokier
2009-01-21 19:43       ` Bryan Henderson
2009-01-21 21:08         ` Jamie Lokier
2009-01-21 22:44           ` Bryan Henderson
2009-01-21 23:31             ` Jamie Lokier
2009-01-21  1:36     ` Nick Piggin
2009-01-21 19:58       ` Bryan Henderson
2009-01-21 20:53         ` Jamie Lokier
2009-01-21 22:14           ` Bryan Henderson
2009-01-21 22:30             ` Jamie Lokier [this message]
2009-01-22  1:52               ` Bryan Henderson
2009-01-22  3:41                 ` Jamie Lokier
2009-01-21  1:29   ` Nick Piggin
2009-01-21  3:15     ` Jamie Lokier
2009-01-21  3:48       ` Nick Piggin
2009-01-21  5:24         ` Jamie Lokier
2009-01-21  6:16           ` Nick Piggin
2009-01-21 11:18             ` Jamie Lokier
2009-01-21 11:41               ` Nick Piggin
2009-01-21 12:09                 ` Jamie Lokier
2009-01-21  4:16       ` Nick Piggin
2009-01-21  4:59         ` Jamie Lokier
2009-01-21  6:23           ` Nick Piggin
2009-01-21 12:02             ` Jamie Lokier
2009-01-21 12:13             ` Theodore Tso
2009-01-21 12:37               ` Jamie Lokier
2009-01-21 14:12                 ` Theodore Tso
2009-01-21 14:35                   ` Chris Mason
2009-01-21 15:58                     ` Eric Sandeen
2009-01-21 20:41                     ` Jamie Lokier
2009-01-21 21:23                       ` jim owens
2009-01-21 21:59                         ` Jamie Lokier
2009-01-21 23:08                           ` btrfs O_DIRECT was " jim owens
2009-01-22  0:06                             ` Jamie Lokier
2009-01-22 13:50                               ` jim owens
2009-01-22 21:18                   ` Florian Weimer
2009-01-22 21:23                     ` Florian Weimer
2009-01-21  3:25     ` Jamie Lokier
2009-01-21  3:52       ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090121223002.GJ16133@shareable.org \
    --to=jamie@shareable.org \
    --cc=hbryan@us.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).