Re: [rfc] fsync_range?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Jamie Lokier <jamie@shareable.org>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [rfc] fsync_range?
Date: Wed, 21 Jan 2009 04:48:35 +0100	[thread overview]
Message-ID: <20090121034835.GG24891@wotan.suse.de> (raw)
In-Reply-To: <20090121031500.GA2354@shareable.org>

On Wed, Jan 21, 2009 at 03:15:00AM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > > I like the idea.  It's much easier to understand than sync_file_range,
> > > whose man page doesn't really explain how to use it correctly.
> > > 
> > > But how is fsync_range different from the sync_file_range syscall with
> > > all its flags set?
> > 
> > sync_file_range would have to wait, then write, then wait. It also
> > does not call into the filesystem's ->fsync function, I don't know
> > what the wider consequences of that are for all filesystems, but
> > for some it means that metadata required to read back the data is
> > not synced properly, and often it means that metadata sync will not
> > work.
> 
> fsync_range() must also wait, write, then wait again.
> 
> The reason is this sequence of events:
> 
>     1. App calls write() on a page, dirtying it.
>     2. Data writeout is initiated by usual kernel task.
>     3. App calls write() on the page again, dirtying it again.
>     4. App calls fsync_range() on the page.
>     5. ... Dum de dum, time passes ...
>     6. Writeout from step 2 completes.
> 
>     7. fsync_range() initiates another writeout, because the
>        in-progress writeout from step 2 might not include the changes from
>        step 3.
> 
>     7. fsync_range() waits for writout from step 7.
>     8. fsync_range() requests a device cache flush if needed (we hope!).
>     9. Returns to app.
> 
> Therefore fsync_range() must wait for in-progress writeout to
> complete, before initiating more writeout and waiting again.

That's only in rare cases where writeout is started but not completed
before we last dirty it and before we call the next fsync. I'd say in
most cases, we won't have to wait (it should often remain clean).

 
> This is the reason sync_file_range() has all those flags.  As I said,
> the man page doesn't really explain how to use it properly.

Well, one can read what the code does. Aside from that extra wait,
and the problem of not syncing metadata, one thing I dislike about
it is that it exposes the new concept of "writeout" to the userspace
ABI.  Previously all we cared about is whether something is safe
on disk or not. So I think it is reasonable to augment the traditional
data integrity APIs which will probably be more easily used by
existing apps.
 

> An optimisation would be to detect I/O that's been queued on an
> elevator, but where the page has not actually been read (i.e. no DMA
> or bounce buffer copy done yet).  Most queued I/O presumably falls
> into this category, and the second writeout would not be required.
> 
> But perhaps this doesn't happen much in real life?

I doubt it would be worth the complexity. It would probably be pretty
fiddly and ugly change to the pagecache.


> Also the kernel is in a better position to decide which order to do
> everything in, and how best to batch it.

Better position than what? I proposed fsync_range (or fsyncv) to be
in-kernel too, of course.

 
> Also, during the first wait (for in-progress writeout) the kernel
> could skip ahead to queuing some of the other pages for writeout as
> long as there is room in the request queue, and come back to the other
> pages later.

Sure it could. That adds yet more complexity and opens possibility for
livelock (you go back to the page you were waiting for to find it was
since redirtied and under writeout again).


> > > For database writes, you typically write a bunch of stuff in various
> > > regions of a big file (or multiple files), then ideally fdatasync
> > > some/all of the written ranges - with writes committed to disk in the
> > > best order determined by the OS and I/O scheduler.
> >  
> > Do you know which databases do this? It will be nice to ask their
> > input and see whether it helps them (I presume it is an OSS database
> > because the "big" ones just use direct IO and manage their own
> > buffers, right?)
> 
> I don't know if anyone uses sync_file_range(), or if it even works
> reliably, since it's not going to get much testing.

The problem is that it is hard to verify. Even if it is getting lots
of testing, it is not getting enough testing with the block device
being shut off or throwing errors at exactly the right time.

In 2.6.29 I just fixed a handful of data integrity and error reporting
bugs in sync that have been there for basically all of 2.6.

 
> I don't use it myself yet.  My interest is in developing (yet
> another?)  high performance but reliable database engine, not an SQL
> one though.  That's why I keep noticing the issues with fsync,
> sync_file_range, barriers etc.
> 
> Take a look at this, though:
> 
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-04/msg00811.html
> 
> "The results show fadvise + sync_file_range is on par or better than
> O_DIRECT. Detailed results are attached."

That's not to say fsync would be any worse. And it's just a microbenchmark
anyway.

 
> By the way, direct I/O is nice but (a) not always possible, and (b)
> you don't get the integrity barriers, do you?

It should. But I wasn't advocating it versus pagecache + syncing,
just wondering what databases could use fsyncv so we can see if
they can test.

 
> > Today, they will have to just fsync the whole file. So they first must
> > identify which parts of the file need syncing, and then gather those
> > parts as a vector.
> 
> Having to fsync the whole file is one reason that some databases use
> separate journal files - so fsync only flushes the journal file, not
> the big data file which can sometimes be more relaxed.
> 
> It's also a reason some databases recommend splitting the database
> into multiple files of limited size - so the hit from fsync is reduced.
> 
> When a single file is used for journal and data (like
> e.g. ext3-in-a-file), every transaction (actually coalesced set of
> transactions) forces the disk head back and forth between two data
> areas.  If the journal can be synced by itself, the disk head doesn't
> need to move back and forth as much.
> 
> Identifying which parts to sync isn't much different than a modern
> filesystem needs to do with its barriers, journals and journal-trees.
> They have a lot in common.  This is bread and butter stuff for
> database engines.
> 
> fsync_range would remove those reasons for using separate files,
> making the database-in-a-single-file implementations more efficient.
> That is administratively much nicer, imho.
> 
> Similar for userspace filesystem-in-a-file, which is basically the same.

Although I think a large part is IOPs rather than data throughput,
so cost of fsync_range often might not be much better.


> > > For this, taking a vector of multiple ranges would be nice.
> > > Alternatively, issuing parallel fsync_range calls from multiple
> > > threads would approximate the same thing - if (big if) they aren't
> > > serialised by the kernel.
> > 
> > I was thinking about doing something like that, but I just wanted to
> > get basic fsync_range... OTOH, we could do an fsyncv syscall and gcc
> > could implement fsync_range on top of that?
> 
> Rather than fsyncv, is there some way to separate the fsync into parts?
> 
>    1. A sequence of system calls to designate ranges.
>    2. A call to say "commit and wait on all those ranges given in step 1".

What's the problem with fsyncv? The problem with your proposal is that
it takes multiple syscalls and that it requires the kernel to build up
state over syscalls which is nasty.

next prev parent reply	other threads:[~2009-01-21  3:48 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-20 16:47 [rfc] fsync_range? Nick Piggin
2009-01-20 18:31 ` Jamie Lokier
2009-01-20 21:25   ` Bryan Henderson
2009-01-20 22:42     ` Jamie Lokier
2009-01-21 19:43       ` Bryan Henderson
2009-01-21 21:08         ` Jamie Lokier
2009-01-21 22:44           ` Bryan Henderson
2009-01-21 23:31             ` Jamie Lokier
2009-01-21  1:36     ` Nick Piggin
2009-01-21 19:58       ` Bryan Henderson
2009-01-21 20:53         ` Jamie Lokier
2009-01-21 22:14           ` Bryan Henderson
2009-01-21 22:30             ` Jamie Lokier
2009-01-22  1:52               ` Bryan Henderson
2009-01-22  3:41                 ` Jamie Lokier
2009-01-21  1:29   ` Nick Piggin
2009-01-21  3:15     ` Jamie Lokier
2009-01-21  3:48       ` Nick Piggin [this message]
2009-01-21  5:24         ` Jamie Lokier
2009-01-21  6:16           ` Nick Piggin
2009-01-21 11:18             ` Jamie Lokier
2009-01-21 11:41               ` Nick Piggin
2009-01-21 12:09                 ` Jamie Lokier
2009-01-21  4:16       ` Nick Piggin
2009-01-21  4:59         ` Jamie Lokier
2009-01-21  6:23           ` Nick Piggin
2009-01-21 12:02             ` Jamie Lokier
2009-01-21 12:13             ` Theodore Tso
2009-01-21 12:37               ` Jamie Lokier
2009-01-21 14:12                 ` Theodore Tso
2009-01-21 14:35                   ` Chris Mason
2009-01-21 15:58                     ` Eric Sandeen
2009-01-21 20:41                     ` Jamie Lokier
2009-01-21 21:23                       ` jim owens
2009-01-21 21:59                         ` Jamie Lokier
2009-01-21 23:08                           ` btrfs O_DIRECT was " jim owens
2009-01-22  0:06                             ` Jamie Lokier
2009-01-22 13:50                               ` jim owens
2009-01-22 21:18                   ` Florian Weimer
2009-01-22 21:23                     ` Florian Weimer
2009-01-21  3:25     ` Jamie Lokier
2009-01-21  3:52       ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090121034835.GG24891@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=jamie@shareable.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.