public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Mike Waychison <mikew@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Theodore Tso <tytso@mit.edu>,
	Andreas Dilger <adilger@clusterfs.com>,
	Sreenivasa Busam <sreenivasac@google.com>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
Subject: Re: fallocate support for bitmap-based files
Date: Fri, 29 Jun 2007 18:07:25 -0400	[thread overview]
Message-ID: <4685829D.2020401@google.com> (raw)
In-Reply-To: <20070629143818.9f4ac7d7.akpm@linux-foundation.org>

Andrew Morton wrote:
> On Fri, 29 Jun 2007 16:55:25 -0400
> Theodore Tso <tytso@mit.edu> wrote:
> 
> 
>>On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote:
>>
>>>Guys, Mike and Sreenivasa at google are looking into implementing
>>>fallocate() on ext2.  Of course, any such implementation could and should
>>>also be portable to ext3 and ext4 bitmapped files.
>>
>>What's the eventual goal of this work?  Would it be for mainline use,
>>or just something that would be used internally at Google?
> 
> 
> Mainline, preferably.
> 
> 
>> I'm not
>>particularly ennthused about supporting two ways of doing fallocate();
>>one for ext4 and one for bitmap-based files in ext2/3/4.  Is the
>>benefit reallyworth it?
> 
> 
> umm, it's worth it if you don't want to wear the overhead of journalling,
> and/or if you don't want to wait on the, err, rather slow progress of ext4.
> 
> 
>>What I would suggest, which would make much easier, is to make this be
>>an incompatible extensions (which you as you point out is needed for
>>security reasons anyway) and then steal the high bit from the block
>>number field to indicate whether or not the block has been initialized
>>or not.  That way you don't end up having to seek to a potentially
>>distant part of the disk to check out the bitmap.  Also, you don't
>>have to worry about how to recover if the "block initialized bitmap"
>>inode gets smashed.  
>>
>>The downside is that it reduces the maximum size of the filesystem
>>supported by ext2 by a factor of two.  But, there are at least two
>>patch series floating about that promise to allow filesystem block
>>sizes > than PAGE_SIZE which would allow you to recover the maximum
>>size supported by the filesytem.
>>
>>Furthermore, I suspect (especially after listening to a very fasting
>>Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks
>>ago) that for many of Google's workloads, using a filesystem blocksize
>>of 16K or 32K might not be a bad thing in any case.
>>
>>It would be a lot simpler....
>>
> 
> 
> Hadn't thought of that.
> 
> Also, it's unclear to me why google is going this way rather than using
> (perhaps suitably-tweaked) ext2 reservations code.
> 
> Because the stock ext2 block allcoator sucks big-time.

The primary reason this is a problem is that our writers into these 
files aren't neccesarily coming from the same hosts in the cluster, so 
their arrival times aren't sequential.  It ends up looking to the kernel 
like a random write workload, which in turn ends up causing odd 
fragmentation patterns that aren't very deterministic.  That data is 
often eventually streamed off the disk though, which is when the 
fragmentation hurts.

Currently, our clustered filesystem supports pre-allocation of the 
target chunks of files, but this is implemented by writting effectively 
zeroes to files, which in turn causes pagecache churn and a double 
write-out of the blocks.  Recently, we've changed the code to minimize 
this pagecache churn and double write out by performing an ftruncate to 
extend files, but then we'll be back to square-one in terms of 
fragmentation for the random writes.

Relying on (a tweaked) reservations code is also somewhat limitting at 
this stage given that reservations are lost on close(fd).  Unless we 
change the lifetime of the reservations (maybe for the lifetime of the 
in-core inode?), crank up the reservation sizes and deal with the 
overcommit issues, I can't think of any better way at this time to deal 
with the problem.

Mike Waychison

  reply	other threads:[~2007-06-29 22:07 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-29 20:01 fallocate support for bitmap-based files Andrew Morton
2007-06-29 20:36 ` Dave Kleikamp
2007-06-29 20:52   ` Mike Waychison
2007-06-29 21:24     ` Dave Kleikamp
2007-06-29 20:55 ` Theodore Tso
2007-06-29 21:38   ` Andrew Morton
2007-06-29 22:07     ` Mike Waychison [this message]
2007-07-04 23:11       ` Valerie Henson
2007-07-06 21:15         ` Mike Waychison
2007-06-29 21:46   ` Andreas Dilger
2007-06-29 22:26     ` Mike Waychison
2007-06-30  5:14       ` Andreas Dilger
2007-06-30 14:31         ` Mingming Cao
2007-06-30 14:13 ` Mingming Cao
2007-06-30 17:29   ` Andreas Dilger
2007-07-02 14:44     ` Mingming Cao
2007-07-02 17:44   ` Badari Pulavarty
2007-07-06 21:33     ` Mike Waychison
2007-07-07  2:05       ` Badari Pulavarty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4685829D.2020401@google.com \
    --to=mikew@google.com \
    --cc=adilger@clusterfs.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=sreenivasac@google.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox