linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Lukáš Czerner" <lczerner@redhat.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: Lukas Czerner <lczerner@redhat.com>,
	linux-ext4@vger.kernel.org, gharm@google.com
Subject: Re: [PATCH] ext4: Do not normalize request from fallocate
Date: Mon, 25 Mar 2013 11:09:35 +0100 (CET)	[thread overview]
Message-ID: <alpine.LFD.2.00.1303251051460.23176@localhost> (raw)
In-Reply-To: <20130324001143.GB4000@thunk.org>

On Sat, 23 Mar 2013, Theodore Ts'o wrote:

> Date: Sat, 23 Mar 2013 20:11:43 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukas Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org, gharm@google.com
> Subject: Re: [PATCH] ext4: Do not normalize request from fallocate
> 
> On Thu, Mar 21, 2013 at 04:50:45PM +0100, Lukas Czerner wrote:
> > 
> > Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
> > large fallocate requests were not physically contiguous. However it is
> > important to see why that is the case. Because the request is so big the
> > allocator will try to find free group to allocate from skipping block
> > groups which are used, which is fine. However it will only allocate
> > extents of 2^15-1 block (limitation of uninitialized extent size)
> > which will leave one block in each block group free which will make the
> > extent tree physically non-contiguous, however _only_ by one block which
> > is perfectly fine.
> 
> Well, it's actually really unfortunate.  The file ends up being more
> fragmented, and from an alignment point of view it's really horrid.
> For a RAID array with a power of 2 stripe size, or a flash device with
> a power of 2 erase block size, the result is actually quite
> spectacularly bad:

Sorry for being dense, but I am trying to understand why this is so
bad and what is the "expected" column there.

The physical offset of each extent bellow starts on the start of the
block group and it seems to me that it's perfectly aligned for every
power of two up to the block group size.

If the extent would start at the physical offset from the "expected"
column, than it would be misaligned.

Maybe I am missing something, or maybe I misunderstood the concept ?
But the only problem I see is that when we would like to use that
remaining one block, but that's expected and the only way to avoid
that is to allocate smaller extents instead as you suggested below
(16384 blocks).

Thanks!
-Lukas

> 
> File size of 1 is 1073741824 (262144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..   32766:     458752..    491518:  32767:             unwritten
>    1:    32767..   65533:     491520..    524286:  32767:     491519: unwritten
>    2:    65534..   98300:     589824..    622590:  32767:     524287: unwritten
>    3:    98301..  131067:     622592..    655358:  32767:     622591: unwritten
>    4:   131068..  163834:     655360..    688126:  32767:     655359: unwritten
>    5:   163835..  196601:     688128..    720894:  32767:     688127: unwritten
>    6:   196602..  229368:     720896..    753662:  32767:     720895: unwritten
>    7:   229369..  262135:     753664..    786430:  32767:     753663: unwritten
>    8:   262136..  262143:     786432..    786439:      8:     786431: unwritten,eof
> 1: 9 extents found
> 
> That being said, what we were doing before was quite bad, and you're
> quite right about your analysis here:
> 
> > This will never happen when we normalize the request because for some
> > reason (maybe bug) it will be normalized to much smaller request (2048
> > blocks) and those extents will then be merged together not leaving any
> > free block in between - hence physically contiguous. However the fact
> > that we're splitting huge requests into ton of smaller ones and then
> > merging extents together is very _very_ bad for fallocate performance.
> > 
> > The situation is even worst since with commit
> > ec22ba8edb507395c95fbc617eea26a6b2d98797 we no longer merge
> > uninitialized extents so we end up with absolutely _huge_ extent tree
> > for bigger fallocate requests which is also bad for performance but not
> > only when fallocate itself, but even when working with the file
> > later on.
> 
> Without this patch, we currently do this for the same 1g file:
> 
> Filesystem type is: ef53
> File size of 2 is 1073741824 (262144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    2047:     305152..    307199:   2048:             unwritten
>    1:     2048..    4095:     307200..    309247:   2048:             unwritten
>    	  	       .....
>  106:   217088..  219135:     522240..    524287:   2048:             unwritten
>  107:   219136..  221183:     591872..    593919:   2048:     524288: unwritten
>  108:   221184..  223231:     593920..    595967:   2048:             unwritten
>  		       .....
>  127:   260096..  262143:     632832..    634879:   2048:             unwritten,eof
> 2: 2 extents found
> 
> So I agree that what we're doing is poor, but the question is, can we
> do something which is better that either of these two results?
> 
> That is, can we improve mballoc so that we keep an fallocated gigabyte
> file as physically contiguous as possible, while using an optimal
> number of on-disk extents?   i.e., 9 extents of length 32767.
> 
> Failing that, can we create 20 extents of length 16384 or so?
> 
> 	      	     	       	       	  	 - Ted
> 

  parent reply	other threads:[~2013-03-25 10:09 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-21 15:50 [PATCH] ext4: Do not normalize request from fallocate Lukas Czerner
2013-03-21 16:03 ` Dmitry Monakhov
2013-03-22 17:10   ` Greg Harmon
2013-03-22 19:36     ` Theodore Ts'o
     [not found]     ` <514cb91d.8a48340a.33fd.ffff9fa3SMTPIN_ADDED_BROKEN@mx.google.com>
2013-03-22 22:19       ` Greg Harmon
2013-03-24  0:11 ` Theodore Ts'o
2013-03-24  2:42   ` Andreas Dilger
2013-03-25 10:09   ` Lukáš Czerner [this message]
2013-03-25 12:53     ` Theodore Ts'o
2013-03-25 13:26       ` Lukáš Czerner
2013-03-25 14:44         ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.00.1303251051460.23176@localhost \
    --to=lczerner@redhat.com \
    --cc=gharm@google.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).