linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Theodore Tso <tytso@mit.edu>,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	Daniel Phillips <phillips@phunq.net>,
	linux-fsdevel@vger.kernel.org, tux3@tux3.org,
	Andrew Morton <akpm@linux-fou
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available
Date: Mon, 16 Mar 2009 16:12:11 +1100	[thread overview]
Message-ID: <20090316051211.GB26138@disturbed> (raw)
In-Reply-To: <20090315214426.GA6357@mit.edu>

On Sun, Mar 15, 2009 at 05:44:26PM -0400, Theodore Tso wrote:
> On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote:
> > > As it happens, Tux3 also physically allocates each _physical_ metadata
> > > block (i.e., what is currently called buffer cache) at the time it is
> > > dirtied.  I don't know if this is the best thing to do, but it is
> > > interesting that you do the same thing.  I also don't know if I want to
> > > trust a library to get this right, before having completely proved out
> > > the idea in a non-trival filesystem.  But good luck with that!  It
> > 
> > I'm not sure why it would be a big problem. fsblock isn't allocating
> > the block itself of course, it just asks the filesystem to. It's
> > trivial to do for fsblock.
> 
> So the really unfortunate thing about allocating the block as soon as
> the page is dirty is that it spikes out delayed allocation.  By
> delaying the physical allocation of the logical->physical mapping as
> long as possible, the filesystem can select the best possible physical
> location.

This is no different to the way delayed allocation with bufferheads
works. Both XFS and ext4 set the buffer_delay flag instead of
allocating up front so that later on in ->writepages we can do
optimal delayed allocation. AFAICT fsblock works the same way....

> XFS, for example, keeps a btree of free regions indexed by
> size so that it can select the perfect location for a newly written
> file which is 24k or 56k long.

Ah, no. It's far more complex than that. To begin with, XFS has
*two* freespace trees per allocation group - one indexed by extent size,
the other by extent starting block.

XFS looks for an exact or nearby extent start block match that is
big enough in the by-block tree. If it can't find a nearby match,
then it looks up a size match in the by-size tree. i.e. the
fundamental allocation assumption is that locality of data placement
matters far more than filling holes in the freespace trees.....

> In addition, XFS uses delayed allocation to avoid the problem of
> uninitalized data becoming visible in the event of a crash.

No it doesn't. Delayed allocation minimises the problem but doesn't
prevent it.  It has been known for years (since before I joined SGI
in 2002) that there is a theoretical timing gap in XFS where the
allocation transaction can commit and a crash occur before data hits
the disk hence exposing stale data.

The reality is that no-one has ever reported exposing stale data in
this scenario, and there has been plenty of effort expended trying
to trigger it. Hence it has remained in the realm of a theoretical
problem....

> If
> fsblock immediately allocates the physical block, then either the
> unitialized data might become available on a system crash (which
> is a security problem), or XFS is going to have to force all newly
> written data blocks to disk before a commit.  If that sounds
> familiar it's what ext3's data=ordered mode does, and it's what is
> responsible for the Firefox 3.0 fsync performance problem.

If this was to occur, the obvious solution to this problem is to
allocate unwritten extents and do conversion after data I/O
completion. That would result in correct metadata/data ordering in
all cases with only a small performance impact and without
introducing ext3-sync-the-world-like issues...

Ted, I appreciate you telling the world over and over again how bad
XFS is and what you think needs to be done to fix it. Truth is, this
would have been a much better email had you written about it from an
ext4 perspective. That way it wouldn't have been full of errors or
sound like a kid caught with his hand in the cookie jar:

"It's not my fault! I was only copying XFS! He did it first!"

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2009-03-16  5:12 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <200903110925.37614.phillips@phunq.net>
     [not found] ` <200903122010.31282.nickpiggin@yahoo.com.au>
     [not found]   ` <200903120315.07610.phillips@phunq.net>
2009-03-12 11:03     ` [Tux3] Tux3 report: Tux3 Git tree available Nick Piggin
2009-03-12 12:24       ` Daniel Phillips
2009-03-12 12:32         ` Matthew Wilcox
2009-03-12 12:45           ` Nick Piggin
2009-03-12 13:12             ` [Tux3] " Daniel Phillips
2009-03-12 13:06           ` Daniel Phillips
2009-03-12 13:04         ` Nick Piggin
2009-03-12 13:59           ` [Tux3] " Matthew Wilcox
2009-03-12 14:19             ` Nick Piggin
2009-03-15  3:24             ` Daniel Phillips
2009-03-15  3:50               ` [Tux3] " Nick Piggin
2009-03-15  4:08                 ` Daniel Phillips
2009-03-15  4:14                   ` [Tux3] " Nick Piggin
2009-03-15  2:41           ` Daniel Phillips
2009-03-15  3:45             ` Nick Piggin
2009-03-15 21:44               ` Theodore Tso
2009-03-15 22:41                 ` Daniel Phillips
2009-03-16 10:32                   ` Nick Piggin
2009-03-16  5:12                 ` Dave Chinner [this message]
2009-03-16  6:38                   ` Theodore Tso
2009-03-16 10:14                     ` Nick Piggin
2009-03-12 17:06       ` [Tux3] " Theodore Tso
2009-03-13  9:32         ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090316051211.GB26138@disturbed \
    --to=david@fromorbit.com \
    --cc=akpm@linux-fou \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=phillips@phunq.net \
    --cc=tux3@tux3.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).