From: Jamie Lokier <jamie@shareable.org>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Theodore Tso <tytso@MIT.EDU>
Subject: Re: [RFC] [PATCH] vfs: Call filesystem callback when backing device caches should be flushed
Date: Wed, 21 Jan 2009 21:41:32 +0000 [thread overview]
Message-ID: <20090121214132.GD16133@shareable.org> (raw)
In-Reply-To: <20090121150557.GC3186@duck.suse.cz>
Jan Kara wrote:
> Well, that would be nice but you cannot return from fsync() until you've
> done the flush. So you have to be careful not to wait for too long. JBD
> actually plays these tricks with sync transaction batching and it's not
> trivial to get this right. So I'd rather avoid it.
Didn't extN for some N do/did something similar?
> > What about O_SYNC writes though? A device flush after each one would
> > be expensive, but that's what equivalence to fsync() implies is
> > needed.
> Yes.
>
> > O_DIRECT writes shouldn't do block_flush_device(), but an app may
> > still need a way to commit data for integrity. So fsync() or
> > fdatasync() called after a series of O_DIRECT writes should call
> > block_flush_device() _even_ though there's no page-cache dirty data to
> > commit, and even if there's no inode change to commit.
> Hmm, this is an interesting point. You're right that we currently miss
> the flushes and we probably need some dirty inode flag like needs_flush or
> so.
Proposal (both together):
1. per-device-queue flag needs_flush.
Set on write queued, clear on flush queued. When clear, flushes
are discarded instead of being queued. Waiting on the discarded
flush waits instead for the last flush which was queued, if it's
still in flight. So the queue will also track that last flush.
2. per-inode flag needs_flush.
Set on write queued from this file (writeback), cleared on flush
sent from this file (i.e. the thing fsync/fdatasync/O_SYNC should
be calling). As above, flushes aren't sent from this file when
this flag is clear, and waiting on a discarded flush waits
instead on the last flush sent for this file, if it's still in
flight. So the file will track that last flush command in
addition to needs_flush.
Implement both. The first doee right thing optimising away
unnecessary journal/tree-log barriers. The second further optimises
individual files.
You *could* have a needs_flush bit per page, to tune it further, in
the same way that fsync_range() and O_DIRECT invalidations etc. are
getting better at working with ranges, but that may be pointless
overengineering (I've no idea).
> > Since you want to avoid issuing two device flushes in a row (they're
> > not free), and a journalling fs may issue one separately, as Joel says
> > a filesystem could override this.
> Yes, journalling filesystems usually take care themselves.
>
> > But I suspect it would be better to keep the generic call to
> > block_flush_device() from fsync(), and at the block layer discard
> > duplicate flushes that have no writes in between.
> Hmm, probably this won't be too hard to implement. OTOH it won't catch
> those cases where some other process manages to squeeze in some writes
> between the two flushes. So I'm not sure if we really want to design things
> this way unless really necessary.
Let me put it this way. ext3 is a journalling fs, and it does _not_
provide integrity with fsync() or fdatasync() in all cases, even with
barriers and data=ordered turned on.
We should have something which provides flushes generically, with the
possibility for the fs to override it with a smarter method when it
knows better.
-- Jamie
next prev parent reply other threads:[~2009-01-21 21:41 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-20 16:05 [RFC] [PATCH] vfs: Call filesystem callback when backing device caches should be flushed Jan Kara
2009-01-20 23:16 ` Joel Becker
2009-01-21 0:16 ` Jamie Lokier
2009-01-21 15:05 ` Jan Kara
2009-01-21 21:41 ` Jamie Lokier [this message]
2009-01-21 12:55 ` Jan Kara
2009-01-21 21:47 ` Jamie Lokier
2009-01-21 21:50 ` Jamie Lokier
2009-01-21 23:25 ` Dave Chinner
2009-01-21 23:55 ` Jamie Lokier
2009-01-22 1:21 ` Dave Chinner
2009-01-22 3:03 ` Jamie Lokier
2009-01-21 22:03 ` Joel Becker
2009-01-21 22:35 ` Jamie Lokier
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090121214132.GD16133@shareable.org \
--to=jamie@shareable.org \
--cc=akpm@linux-foundation.org \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@MIT.EDU \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).