linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: tytso@mit.edu
To: Jan Kara <jack@suse.cz>
Cc: Kailas Joshi <kailas.joshi@gmail.com>,
	Jiaying Zhang <jiayingz@google.com>,
	linux-ext4@vger.kernel.org
Subject: Re: Help on Implementation of EXT3 type Ordered Mode in EXT4
Date: Tue, 9 Feb 2010 12:41:45 -0500	[thread overview]
Message-ID: <20100209174145.GU4494@thunk.org> (raw)
In-Reply-To: <20100209160522.GE15318@atrey.karlin.mff.cuni.cz>

On Tue, Feb 09, 2010 at 05:05:22PM +0100, Jan Kara wrote:
>   Hi,
> 
> > I recently found that in EXT4 with delayed block the Ordered mode does not
> > bahave same as in EXT3.
> > I found a patch for this at http://lwn.net/Articles/324023/, but it has some
> > journal block estimation problem resulting into deadlock.
> > 
> > I would like to know if it has been solved.
> > If not, is it possible to solve it? What are the complexities involved?
>
> It has not been solved. The problem is that to commit data on
> transaction commit (which is what data=ordered mode has historically
> done), you have to allocate space for these blocks. But that
> allocation needs to modify a filesystem and thus journal more
> blocks... And that is tricky - we would have to reserve space in the
> current transaction for allocation of delayed data.  So it gets a
> bit messy...

The dioread_nolock patches from Jiaying, which are currently in the
unstable portion of the tree, is a partial solution to the
data=ordered problem, although it solves it in a slightly different
way.

As a side effect of trying to avoid locking on the direct I/O read
path, on the buffered I/O write path it changes things so the extent
tree is first changed so the blocks are allocated with the "extent
uninitialized" bit, and then only after the blocks hit the disk, via
the bh completion callback, do we set the extent so that it is marked
as containing initialized data.

As a result, if you crash before the extent tree is updated, when you
read from the file, you will get all zero's, instead of the data, thus
preventing the security leak.

It does mean that fsync() is slightly slower, since we now have to
flush the data blocks out, wait for the completion handler to fire and
update the extent in the same jbd2 transaction, and only then wait for
the barrier in the jbd2 transaction.  (And in fact, I'm not sure
fsync() is completely working correctly in the current patch in the
unstable patch stream, and there aren't race conditions where the
extent tree update slips into the next transaction.)  But it does
solve the problem.

The other downside with this solution is that it only works for files
that are extent-mapped, and if you do this with a converted ext3 file
system, and there are files that are still mapped using
direct/indirect blocks, when you change the mount option to be
data=writeback,dioread_nolock, the block allocating writes to these
legacy files could result in data getting exposed after a crash.

Depending on the workload the upside is that by using data=writeback
instead of data=ordered could far outweigh the downside of needing to
do an extra block I/O queue flush before the fsync, since it reduces
the number of entangled writes to only the metadata blocks, where
previously the entagled write problem affected metadata blocks plus
all freshly allocated blocks.

Kalias, this is something that I plan to look in the near future; if
you are interested in helping to benchmark and characterize this
solution, I'd be very interested in working with you.  Can you tell me
a little more about your use case and requirements?

  	      	    	     	      - Ted


  reply	other threads:[~2010-02-09 17:41 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-04  5:45 Help on Implementation of EXT3 type Ordered Mode in EXT4 Kailas Joshi
2010-02-09 16:05 ` Jan Kara
2010-02-09 17:41   ` tytso [this message]
     [not found]     ` <38f6fb7d1002102301x278c3ddt153f570dd1423074@mail.gmail.com>
2010-02-11  7:32       ` Kailas Joshi
2010-02-11 19:56         ` tytso
2010-02-12  3:22           ` Kailas Joshi
2010-02-12 20:07             ` tytso
2010-02-13  8:43               ` Kailas Joshi
2010-02-15 15:00                 ` Jan Kara
2010-02-16 10:10                   ` Kailas Joshi
2010-02-16 13:10                     ` Jan Kara
2010-02-16 14:18                       ` tytso
2010-02-17 15:37                         ` Kailas Joshi
     [not found]                           ` <38f6fb7d1003182023j5513640csdc797adb49393ea0@mail.gmail.com>
2010-03-22 16:52                             ` Jan Kara
2010-03-23 10:41                               ` Kailas Joshi
2010-03-29 15:45                                 ` Jan Kara
2010-04-17  4:42                                   ` Kailas Joshi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100209174145.GU4494@thunk.org \
    --to=tytso@mit.edu \
    --cc=jack@suse.cz \
    --cc=jiayingz@google.com \
    --cc=kailas.joshi@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).