public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] extents,delayed allocation,mballoc for ext3
@ 2004-04-13 19:28 alex
  2004-04-14  4:01 ` Matt Mackall
  2004-04-14 12:10 ` Alex Tomas
  0 siblings, 2 replies; 7+ messages in thread
From: alex @ 2004-04-13 19:28 UTC (permalink / raw)
  To: ext2-devel; +Cc: linux-kernel, alex


these patches implement several features for ext3:
- extents
- multiblock allocator
- delayed allocation (a.k.a. allocation on flush)


extents
=======
it's just a way to store inode's blockmap in well-known triples
[logical block; phys. block; length]. all the extents are stored
in B+Tree. code is splitted in two parts:
1) generic extents support
   implements primitives like lookup, insert, remove, walk
2) VFS part
   implements ->getblock() and ->truncate() methods

multiblock allocator
===================
the larger extents the better. the reasonable way is to ask block
allocator to allocate several blocks at once. it is possible to
scan bitmaps, but such a scanning isn't very good method. so, here
is mballoc - buddy algorithm + possibility to find contig.buddies
fast way. mballoc is backward-compatible, buddies are stored on a
disk as usual file (temporal solution until fsck support is ready)
and regenerated at mount time. also, with existing block-at-once
allocator it's impossible to write at very high rate (several
hundreds MB a sec). multiblock allocator solves this issue.

delayed allocation
==================
this is ->writepages() implementation that exploits very nice tagged
radix tree. it finds contiguous spaces and asks extents code to walk
over specified ranges of blocks. extents code calls given callback
routine that allocates blocks for listed pages, cookes a bio's and
submit them on a disk.


todo
====
1) blocks must be reserved to avoid -ENOSPC upon writeback
2) blocks must be available for allocation after committing only
3) data=order support
4) blocksize < PAGE_CACHE_SIZE support
5) option to allocator to look for +N blocks if goal is busy
6) probably preallocation for slowly-growing files
7) allocation policy tuning
8) regenerate buddies in crash case only


NOTE: don't try to use it in production. all the patches (probably
excluding extents) are pre-pre-alpha. because of size I put patches
in ftp://ftp.clusterfs.com/pub/people/alex/2.6.4-mm2/


benchmarks (hardware: 2way iPIII-1000Mhz/512MB/old scsi hdd (20MB/s)
====================================================================

I ran dd to write specified amount of data and measured time ext3
spent in allocator via get_cycles(). all the bitmaps were preloaded.

size 5kb, before:  2 allocations, 30285 cycles
size 5kb, before:  2 allocations, 27550 cycles
size 5kb, before:  2 allocations, 27307 cycles
size 5kb, before:  2 allocations, 27486 cycles
14078 cycles per block

size 5kb, after :  1 allocations, 50531 cycles
size 5kb, after :  1 allocations, 47915 cycles
size 5kb, after :  1 allocations, 51890 cycles
size 5kb, after :  1 allocations, 49094 cycles
24928 cycles per block

size 600kb, before:  151 allocations, 443282 cycles
size 600kb, before:  151 allocations, 439809 cycles
size 600kb, before:  151 allocations, 438705 cycles
size 600kb, before:  151 allocations, 506309 cycles
3026 cycles per block

size 600kb, after :  1 allocations, 55344 cycles
size 600kb, after :  1 allocations, 55094 cycles
size 600kb, after :  1 allocations, 54311 cycles
size 600kb, after :  1 allocations, 102892 cycles
446 cycles per block

size 12445kb, before:  3117 allocations, 9683780 cycles
size 12445kb, before:  3117 allocations, 9866494 cycles
size 12445kb, before:  3117 allocations, 9702287 cycles
size 12445kb, before:  3117 allocations, 10127695 cycles
3158 cycles per block

size 12445kb, after :  1 allocations, 60446 cycles
size 12445kb, after :  1 allocations, 60978 cycles
size 12445kb, after :  1 allocations, 65121 cycles
size 12445kb, after :  1 allocations, 61893 cycles
20 cycles per block


single dd writes 2GB:
before:
	real    2m2.623s
	user    0m0.028s
	sys     0m12.236s

after:
	real    1m49.696s
	user    0m0.028s
	sys     0m8.008s


9 copies of dd, each writes 20MB:
before:
	real    1m22.151s
	user    0m0.057s
	sys     0m2.102s

after:
	real    0m9.664s
	user    0m0.061s
	sys     0m1.209s


time to dbench things ...
before
	Throughput 150.215 MB/sec 8 procs
	Throughput 140.273 MB/sec 8 procs
	Throughput 153.377 MB/sec 8 procs
	Throughput 101.198 MB/sec 8 procs
	Average: 136.26575

	Throughput 68.8406 MB/sec 16 procs
	Throughput 83.0574 MB/sec 16 procs
	Throughput 48.3245 MB/sec 16 procs
	Throughput 54.4254 MB/sec 16 procs
	Average: 63.66197

	Throughput 53.6807 MB/sec 32 procs
	Throughput 66.5997 MB/sec 32 procs
	Throughput 59.4454 MB/sec 32 procs
	Throughput 62.9191 MB/sec 32 procs
	Average: 60.66122

after:
	Throughput 226.799 MB/sec 8 procs
	Throughput 205.548 MB/sec 8 procs
	Throughput 220.675 MB/sec 8 procs
	Throughput 192.285 MB/sec 8 procs
	Average: 211.32675

	Throughput 178.969 MB/sec 16 procs
	Throughput 182.105 MB/sec 16 procs
	Throughput 196.786 MB/sec 16 procs
	Throughput 180.526 MB/sec 16 procs
	Average: 184.59650

	Throughput 139.905 MB/sec 32 procs
	Throughput 132.72 MB/sec 32 procs
	Throughput 133.429 MB/sec 32 procs
	Throughput 131.498 MB/sec 32 procs
	Average: 134.38800


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] extents,delayed allocation,mballoc for ext3
  2004-04-13 19:28 [RFC] extents,delayed allocation,mballoc for ext3 alex
@ 2004-04-14  4:01 ` Matt Mackall
  2004-04-14 12:05   ` Alex Tomas
  2004-04-14 12:10 ` Alex Tomas
  1 sibling, 1 reply; 7+ messages in thread
From: Matt Mackall @ 2004-04-14  4:01 UTC (permalink / raw)
  To: alex; +Cc: ext2-devel, linux-kernel

On Tue, Apr 13, 2004 at 11:28:57PM +0400, alex@clusterfs.com wrote:
> 
> these patches implement several features for ext3:
> - extents
> - multiblock allocator
> - delayed allocation (a.k.a. allocation on flush)
> 
> 
> extents
> =======
> it's just a way to store inode's blockmap in well-known triples
> [logical block; phys. block; length]. all the extents are stored
> in B+Tree. code is splitted in two parts:
> 1) generic extents support
>    implements primitives like lookup, insert, remove, walk
> 2) VFS part
>    implements ->getblock() and ->truncate() methods

I'm going to assume that there's no way for ext3 without extents
support to mount such a filesystem, so I think this means changing the
FS name. Is there a simple migration path to extents for existing filesystems?
 
> multiblock allocator
> ===================
> the larger extents the better. the reasonable way is to ask block
> allocator to allocate several blocks at once. it is possible to
> scan bitmaps, but such a scanning isn't very good method. so, here
> is mballoc - buddy algorithm + possibility to find contig.buddies
> fast way. mballoc is backward-compatible, buddies are stored on a
> disk as usual file (temporal solution until fsck support is ready)
> and regenerated at mount time. also, with existing block-at-once
> allocator it's impossible to write at very high rate (several
> hundreds MB a sec). multiblock allocator solves this issue.

Similar questions here.
 
> NOTE: don't try to use it in production. all the patches (probably
> excluding extents) are pre-pre-alpha. because of size I put patches
> in ftp://ftp.clusterfs.com/pub/people/alex/2.6.4-mm2/

You might also mention that on-disk format issues such as endian
layout are not finalized.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] extents,delayed allocation,mballoc for ext3
  2004-04-14  4:01 ` Matt Mackall
@ 2004-04-14 12:05   ` Alex Tomas
  2004-04-19 19:47     ` [Ext2-devel] " Stephen C. Tweedie
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Tomas @ 2004-04-14 12:05 UTC (permalink / raw)
  To: Matt Mackall; +Cc: alex, ext2-devel, linux-kernel

>>>>> Matt Mackall (MM) writes:

 MM> I'm going to assume that there's no way for ext3 without extents
 MM> support to mount such a filesystem, so I think this means changing the
 MM> FS name. Is there a simple migration path to extents for existing filesystems?

yeah. you're right. I see no way to make it backward-compatible. in fact,
I haven't think much about name. probably you're right again and this
"ext3 on steroids" should have another name.
 
 MM> Similar questions here.

no. this one is backward-compatible and usual ext3 will run ok.
btw, I think it possible to implement few routines that could allow
to exploit delayed allocation and multiblock allocator patches w/o
introducing extents. the most visible effect of the extents is much
faster truncate.

 MM> You might also mention that on-disk format issues such as endian
 MM> layout are not finalized.

yep. thanks for notice.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] extents,delayed allocation,mballoc for ext3
  2004-04-13 19:28 [RFC] extents,delayed allocation,mballoc for ext3 alex
  2004-04-14  4:01 ` Matt Mackall
@ 2004-04-14 12:10 ` Alex Tomas
  2004-04-14 17:49   ` [Ext2-devel] " Mingming Cao
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Tomas @ 2004-04-14 12:10 UTC (permalink / raw)
  To: ext2-devel; +Cc: linux-kernel, alex


I've just benched ext3 vs. ext3+reservation vs. ext3+delalloc vs. xfs.
it was tiobench.

Sequential Writes
                              File  Blk   Num                   Avg     CPU
Identifier                    Size  Size  Thr   Rate  (CPU%)  Latency   Eff
---------------------------- ------ ----- ---  ------ ------ --------- -----
ext3                          1024  4096    4   13.34 14.76%     0.897    90
ext3-dalloc                   1024  4096    4   26.39 19.26%     0.452   137
ext3-reserv                   1024  4096    4   23.77 28.99%     0.529    82
xfs                           1024  4096    4   27.22 20.68%     0.373   132

ext3                          1024  4096    8    9.71 10.82%     2.421    90
ext3-dalloc                   1024  4096    8   25.81 18.64%     0.816   138
ext3-reserv                   1024  4096    8   23.62 29.49%     1.006    80
xfs                           1024  4096    8   27.06 22.49%     0.763   120

ext3                          1024  4096   16    6.60 7.891%     7.222    84
ext3-dalloc                   1024  4096   16   24.99 19.71%     1.783   127
ext3-reserv                   1024  4096   16   23.04 28.15%     1.849    82
xfs                           1024  4096   16   24.84 20.58%     1.300   121

ext3                          1024  4096   32    8.12 9.872%     8.111    82
ext3-dalloc                   1024  4096   32   24.83 20.01%     2.995   124
ext3-reserv                   1024  4096   32   22.72 29.51%     3.282    77
xfs                           1024  4096   32   25.47 21.75%     2.247   117

ext3-dalloc is ext3 + extents + delayed allocation + multiblock allocator
ext3-reserv is ext3 + reservation patches by Mingming Cao


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Ext2-devel] Re: [RFC] extents,delayed allocation,mballoc for ext3
  2004-04-14 12:10 ` Alex Tomas
@ 2004-04-14 17:49   ` Mingming Cao
  0 siblings, 0 replies; 7+ messages in thread
From: Mingming Cao @ 2004-04-14 17:49 UTC (permalink / raw)
  To: Alex Tomas; +Cc: ext2-devel, linux-kernel

On Wed, 2004-04-14 at 05:10, Alex Tomas wrote:
> 
> I've just benched ext3 vs. ext3+reservation vs. ext3+delalloc vs. xfs.
> it was tiobench.
> ext3                          1024  4096   32    8.12 9.872%     8.111    82
> ext3-dalloc                   1024  4096   32   24.83 20.01%     2.995   124
> ext3-reserv                   1024  4096   32   22.72 29.51%     3.282    77
> xfs                           1024  4096   32   25.47 21.75%     2.247   117
> 
Hi Alex,

Nice comparison! The ext3 reservation system use more cpus because we do
reservations in memory( not on disk) and we have a global lock per
filesystem to guard the operation.  The current search for a new
reservation window algorithm is not perfect right now.  

extents and delayed allocation probably is the right way to go for next
generation (maybe ext4).  Currently I just try to fix the missing
preallocation feature in ext3, without break the disk compatibility and
involve too much changes....

Thanks for your interest.

Mingming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Ext2-devel] Re: [RFC] extents,delayed allocation,mballoc for ext3
  2004-04-14 12:05   ` Alex Tomas
@ 2004-04-19 19:47     ` Stephen C. Tweedie
  2004-04-20 22:05       ` Matt Mackall
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen C. Tweedie @ 2004-04-19 19:47 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Matt Mackall, ext2-devel@lists.sourceforge.net, linux-kernel,
	Stephen Tweedie

Hi,

On Wed, 2004-04-14 at 13:05, Alex Tomas wrote:

>  MM> I'm going to assume that there's no way for ext3 without extents
>  MM> support to mount such a filesystem, so I think this means changing the
>  MM> FS name. Is there a simple migration path to extents for existing filesystems?
> 
> yeah. you're right. I see no way to make it backward-compatible. in fact,
> I haven't think much about name. probably you're right again and this
> "ext3 on steroids" should have another name.

We've already got feature compatibility bits that can deal with this
sort of thing.  There are various other proposed incompatible features,
such as large inodes and dynamically placed metadata (eg. placing inode
tables into an inode "file"), too.  Rather than invent new names for
each combination of incompatible feature set, we're probably better off
just using the feature masks.

Cheers,
 Stephen



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Ext2-devel] Re: [RFC] extents,delayed allocation,mballoc for ext3
  2004-04-19 19:47     ` [Ext2-devel] " Stephen C. Tweedie
@ 2004-04-20 22:05       ` Matt Mackall
  0 siblings, 0 replies; 7+ messages in thread
From: Matt Mackall @ 2004-04-20 22:05 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alex Tomas, ext2-devel@lists.sourceforge.net, linux-kernel

On Mon, Apr 19, 2004 at 08:47:10PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Wed, 2004-04-14 at 13:05, Alex Tomas wrote:
> 
> >  MM> I'm going to assume that there's no way for ext3 without extents
> >  MM> support to mount such a filesystem, so I think this means changing the
> >  MM> FS name. Is there a simple migration path to extents for existing filesystems?
> > 
> > yeah. you're right. I see no way to make it backward-compatible. in fact,
> > I haven't think much about name. probably you're right again and this
> > "ext3 on steroids" should have another name.
> 
> We've already got feature compatibility bits that can deal with this
> sort of thing.  There are various other proposed incompatible features,
> such as large inodes and dynamically placed metadata (eg. placing inode
> tables into an inode "file"), too.  Rather than invent new names for
> each combination of incompatible feature set, we're probably better off
> just using the feature masks.

I'm aware of the existence of such features, I just think it's yet to
be demonstrated that they're actually a good idea for real deployment
by themselves. ext3+{btree,extents} is not backwards compatible in any
useful sense, unlike features such as journalling, directory hashing,
sparse superblocks, wandering journals, etc. Given that you can't
mount the new filesystem with an old kernel, not changing the name can
only result in confusion.

But I see your point about dealing with a cartesian product of
features. So if and when this stuff approaches beta, we should
probably use the feature flags _and_ change the name to something like
ext3+be (btrees, extents) or ext3+i (inode in file) to indicate the
presence of experimental, incompatible features, and when the feature
set is actually pinned down, rename it simply ext3+ or ext4 or whatever.

It might be possible to have ext4 actually be a family of filesystems
where extents or large inodes are optional, but I suspect the value of
that would be minimal and again, all such features would have to be
available in every kernel tree that claimed to support ext4.

-- 
Matt Mackall : http://www.selenic.com : Linux development and consulting

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-04-20 22:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-13 19:28 [RFC] extents,delayed allocation,mballoc for ext3 alex
2004-04-14  4:01 ` Matt Mackall
2004-04-14 12:05   ` Alex Tomas
2004-04-19 19:47     ` [Ext2-devel] " Stephen C. Tweedie
2004-04-20 22:05       ` Matt Mackall
2004-04-14 12:10 ` Alex Tomas
2004-04-14 17:49   ` [Ext2-devel] " Mingming Cao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox