linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Nick Piggin <npiggin@suse.de>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [rfc] fsync_range?
Date: Wed, 21 Jan 2009 03:25:20 +0000	[thread overview]
Message-ID: <20090121032520.GA2816@shareable.org> (raw)
In-Reply-To: <20090121012900.GD24891@wotan.suse.de>

Nick Piggin wrote:
> > For database writes, you typically write a bunch of stuff in various
> > regions of a big file (or multiple files), then ideally fdatasync
> > some/all of the written ranges - with writes committed to disk in the
> > best order determined by the OS and I/O scheduler.
>  
> Do you know which databases do this? It will be nice to ask their
> input and see whether it helps them (I presume it is an OSS database
> because the "big" ones just use direct IO and manage their own
> buffers, right?)

I just found this:

   http://markmail.org/message/injyo7coein7o3xz
   (Postgresql)

Tom Lane writes (on org.postgreql.pgsql-hackets):
>Greg Stark <gsst...@mit.edu> writes:
>> Come to think of it I wonder whether there's anything to be gained by
>> using smaller files for tables. Instead of 1G files maybe 256M files
>> or something like that to reduce the hit of fsyncing a file.
>>
>> Actually probably not. The weak part of our current approach is that
>> we tell the kernel "sync this file", then "sync that file", etc, in a
>> more or less random order. This leads to a probably non-optimal
>> sequence of disk accesses to complete a checkpoint. What we would
>> really like is a way to tell the kernel "sync all these files, and let
>> me know when you're done" --- then the kernel and hardware have some
>> shot at scheduling all the writes in an intelligent fashion.
>>
>> sync_file_range() is not that exactly, but since it lets you request
>> syncing and then go back and wait for the syncs later, we could get
>> the desired effect with two passes over the file list. (If the file
>> list is longer than our allowed number of open files, though, the
>> extra opens/closes could hurt.)
>>
>> Smaller files would make the I/O scheduling problem worse not better. 

So if you can make
commit-to-multiple-files-in-optimal-I/O-scheduling-order work, that
would be even better ;-)

Seems to me the Postgresql thing could be improved by issuing parallel
fdatasync() calls each in their own thread.  Not optimal, exactly, but
more parallelism to schedule around.  (But limited by the I/O request
queue being full with big flushes, so potentially one fdatasync()
starving the others.

-- Jamie

  parent reply	other threads:[~2009-01-21  3:25 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-20 16:47 [rfc] fsync_range? Nick Piggin
2009-01-20 18:31 ` Jamie Lokier
2009-01-20 21:25   ` Bryan Henderson
2009-01-20 22:42     ` Jamie Lokier
2009-01-21 19:43       ` Bryan Henderson
2009-01-21 21:08         ` Jamie Lokier
2009-01-21 22:44           ` Bryan Henderson
2009-01-21 23:31             ` Jamie Lokier
2009-01-21  1:36     ` Nick Piggin
2009-01-21 19:58       ` Bryan Henderson
2009-01-21 20:53         ` Jamie Lokier
2009-01-21 22:14           ` Bryan Henderson
2009-01-21 22:30             ` Jamie Lokier
2009-01-22  1:52               ` Bryan Henderson
2009-01-22  3:41                 ` Jamie Lokier
2009-01-21  1:29   ` Nick Piggin
2009-01-21  3:15     ` Jamie Lokier
2009-01-21  3:48       ` Nick Piggin
2009-01-21  5:24         ` Jamie Lokier
2009-01-21  6:16           ` Nick Piggin
2009-01-21 11:18             ` Jamie Lokier
2009-01-21 11:41               ` Nick Piggin
2009-01-21 12:09                 ` Jamie Lokier
2009-01-21  4:16       ` Nick Piggin
2009-01-21  4:59         ` Jamie Lokier
2009-01-21  6:23           ` Nick Piggin
2009-01-21 12:02             ` Jamie Lokier
2009-01-21 12:13             ` Theodore Tso
2009-01-21 12:37               ` Jamie Lokier
2009-01-21 14:12                 ` Theodore Tso
2009-01-21 14:35                   ` Chris Mason
2009-01-21 15:58                     ` Eric Sandeen
2009-01-21 20:41                     ` Jamie Lokier
2009-01-21 21:23                       ` jim owens
2009-01-21 21:59                         ` Jamie Lokier
2009-01-21 23:08                           ` btrfs O_DIRECT was " jim owens
2009-01-22  0:06                             ` Jamie Lokier
2009-01-22 13:50                               ` jim owens
2009-01-22 21:18                   ` Florian Weimer
2009-01-22 21:23                     ` Florian Weimer
2009-01-21  3:25     ` Jamie Lokier [this message]
2009-01-21  3:52       ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090121032520.GA2816@shareable.org \
    --to=jamie@shareable.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).