[RFC 0/13] extents and 48bit ext3

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/13] extents and 48bit ext3
@ 2006-06-09  1:20 Mingming Cao
  2006-06-09  2:40 ` Valdis.Kletnieks
                   ` (19 more replies)
  0 siblings, 20 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-09  1:20 UTC (permalink / raw)
  To: linux-kernel, ext2-devel, linux-fsdevel

Current ext3 filesystem is limited to 8TB(4k block size), this is
practically not enough for the increasing need of bigger storage as
disks in a few years (or even now).

To address this need, there are co-effort from RedHat, ClusterFS, IBM
and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
ext3 is build on top of extent map changes for ext3, originally from
Alex Tomas. In short, the new ext3 on-disk extents format is:

On disk extents format:
/*
  * this is extent on-disk structure
  * it's used at the bottom of the tree
  */
struct ext3_extent {
        __le32  ee_block;       /* first logical block extent covers */
        __le16  ee_len;         /* number of blocks covered by extent */
        __le16  ee_start_hi;    /* high 16 bits of physical block */
        __le32  ee_start;       /* low 32 bigs of physical block */
};

A series of patches have been posted to ext2-devel list in last month
and have been reviewed.  This is updated full series of patches to
support 48 bit ext3 based on extent map. Patches are against 2.6.17-rc6
kernel, and could be found at
http://ext2.sourceforge.net/48bitext3/patches/patches-2.6.17-
rc6-06082006/

[patch 1/13] percpu_counter_longlong.patch
percpu count data type changes to support 64 bit ext3 free blocks count

[patch 2/13] ext3_check_sector_t_overflow.patch
sector_t overflow check for 32bit/48bit ext3 at mount/resize time

[patch 3/13] ext3_fsblk_t_fixes.patch
Define ext3 filesystem and group block types (ext3_fsblk_t,
ext3_grpblk_t, and fix in-kernel ext3 block types (from int type to
ext3_fsblk_t) to support 32bit ext3.

[patch 4/13] ext3_convert_blks_to_fsblk_t.patch
convert the rest of ext3 filesystem blocks to ext3_fsblk_t

patches 1-4 are currently in mm tree

[patch 5/13] sector_fmt.patch
sector_t type format string for all arch.

[patch 6/13] ext3_fsblk_sector_t.patch
support >32bit bit fs block type in kernel (convert ext3_fsblk_t to
sector_t)

[patch 7/13] 64bit_jbd_core.patch
Core 64 bit JBD changes

[patch 8/13] sector_t-jbd.patch
JBD layer in-kernel block variables type fixes to support >32
bit block number and convert to sector_t type.

#extent map patches
[patch 9/13] ext3-extents.patch
core extent map support

[patch 10/13] ext3-extents-48bit.patch
Add full 48 bit physical block support based on extents.

[patch 11/13] ext3-extents-ext3_fsblk_t.patch
convert block types in extents to ext3_fsblk_t

[patch 12/13]ext3_48bit_i_file_acl.pat
48 bit on-disk i_file_acl to support xttar for 48 bit ext3

[patch 13/13] 64bit-metadata
On-disk and in-kernel super block changes to support >32
bit free blocks numbers.

Appreciate any comments and feedbacks!

Mingming

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
@ 2006-06-09  2:40 ` Valdis.Kletnieks
  2006-06-09  8:20   ` Andreas Dilger
  2006-06-09 15:23   ` Mingming Cao
  2006-06-09  2:49 ` Jeff Garzik
                   ` (18 subsequent siblings)
  19 siblings, 2 replies; 296+ messages in thread
From: Valdis.Kletnieks @ 2006-06-09  2:40 UTC (permalink / raw)
  To: cmm; +Cc: linux-kernel, ext2-devel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

On Thu, 08 Jun 2006 18:20:54 PDT, Mingming Cao said:
> Current ext3 filesystem is limited to 8TB(4k block size), this is
> practically not enough for the increasing need of bigger storage as
> disks in a few years (or even now).
> 
> To address this need, there are co-effort from RedHat, ClusterFS, IBM
> and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> ext3 is build on top of extent map changes for ext3, originally from
> Alex Tomas. In short, the new ext3 on-disk extents format is:

which implies matching changes to mkfs.ext2 and possibly mount..

> Appreciate any comments and feedbacks!

Somebody else was recently discussing a set of patches to ext3 for
extents+delalloc+mballoc patches - is this work compatible with that?

Also, a pointer to the matching userspace patches would help anybody
who's gung-ho enough to test the code....


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
  2006-06-09  2:40 ` Valdis.Kletnieks
@ 2006-06-09  2:49 ` Jeff Garzik
  2006-06-09  8:35   ` Andreas Dilger
  2006-06-09 17:14   ` Alan Cox
  2006-06-09  9:13 ` Christoph Hellwig
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09  2:49 UTC (permalink / raw)
  To: cmm, Andrew Morton, Linus Torvalds
  Cc: linux-kernel, ext2-devel, linux-fsdevel

Mingming Cao wrote:
> Current ext3 filesystem is limited to 8TB(4k block size), this is
> practically not enough for the increasing need of bigger storage as
> disks in a few years (or even now).
> 
> To address this need, there are co-effort from RedHat, ClusterFS, IBM
> and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> ext3 is build on top of extent map changes for ext3, originally from
> Alex Tomas. In short, the new ext3 on-disk extents format is:

One of my common complaints about massive ext3 updates such as this is 
the ever-growing "which ext3 filesystem am I mounting?" problem.

I really think extents and 48bit-ness should imply
	cp -a fs/ext3 fs/ext4
and go from there.

IMHO the ext3 back-compat situation is already really hairy, with all 
the features added since the original ext3 release.

The alternative is continual bloating of ext3, and on filesystems, 
inodes which are progressively upgraded -- meaning any use of a prior 
kernel implies that you can only read a subset of your [meta]data, if 
the back-compat code doesn't block the mount entirely.

People (including me) still switch back and forth between ext2 and ext3 
mounts of the same filesystem on occasion.  I think creating an "ext4" 
would allow for greater developer flexibility in implementing new 
features and ditching old ones -- while also emphasizing to the user 
that switching back and forth between ext4 and ext[23] is a bad idea.

Overall, after applying extent (and 48bit) patches, I think it is wrong 
to keep calling it ext3.  That will break some existing user 
assumptions, and continue to restrict developers' freedom to implement 
nifty new features.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  2:40 ` Valdis.Kletnieks
@ 2006-06-09  8:20   ` Andreas Dilger
  2006-06-09 18:35     ` [Ext2-devel] " Stephen C. Tweedie
  2006-06-09 15:23   ` Mingming Cao
  1 sibling, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09  8:20 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: cmm, linux-kernel, ext2-devel, linux-fsdevel

On Jun 08, 2006  22:40 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Thu, 08 Jun 2006 18:20:54 PDT, Mingming Cao said:
> > To address this need, there are co-effort from RedHat, ClusterFS, IBM
> > and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> > expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> > ext3 is build on top of extent map changes for ext3, originally from
> > Alex Tomas. In short, the new ext3 on-disk extents format is:
> 
> which implies matching changes to mkfs.ext2 and possibly mount..

The extents format doesn't need any support from mke2fs.  Currently this
is activated by a mount option "-o extents", so it won't be used until
a system administrator actively enables it.

> > Appreciate any comments and feedbacks!
> 
> Somebody else was recently discussing a set of patches to ext3 for
> extents+delalloc+mballoc patches - is this work compatible with that?

Yes, completely compatible (author is the same person).  We have all been
working to get these improvements into the vanilla kernel so that everyone
can benefit from the improved performance.  These patches are just the
start - the mballoc and delalloc patches are follow-on patches, but they
do not affect the on-disk format just the in-memory implementation of
block allocation.

> Also, a pointer to the matching userspace patches would help anybody
> who's gung-ho enough to test the code....

They were posted to the ext2-devel mailing list previously, or you can
download a patched RPM at ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/
(the extent support is making its way into the official e2fsprogs also).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  2:49 ` Jeff Garzik
@ 2006-06-09  8:35   ` Andreas Dilger
  2006-06-09 15:08     ` Jeff Garzik
  2006-06-09 17:14   ` Alan Cox
  1 sibling, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09  8:35 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel

On Jun 08, 2006  22:49 -0400, Jeff Garzik wrote:
> One of my common complaints about massive ext3 updates such as this is 
> the ever-growing "which ext3 filesystem am I mounting?" problem.
> 
> I really think extents and 48bit-ness should imply
> 	cp -a fs/ext3 fs/ext4
> and go from there.

The problem with this approach (as seen with ext2 and ext3) is that one
tree or the other gets stale w.r.t. bug fixes and now we have the case
where ext2 has a noticably different implementation in some areas and
bug fixes are no longer trivial to apply to both trees.

I think all of the ext3 maintainers think this split was a bad idea in
hindsight, and having an ext3 mode where it can mount without a journal
would be much more desirable.

> IMHO the ext3 back-compat situation is already really hairy, with all 
> the features added since the original ext3 release.

While partially true, ext2/ext3 has a very good history w.r.t. compatibility
(with one exception being the EAs on symlinks problem that slipped through
with selinux).

Yes, the extents format will be incompatible with older ext3, but it isn't
enabled by default so it will be completely up to the sysadmin when they
make their filesystem incompatible.  They also won't impact any existing
files.  The earlier extents support gets into a kernel.org kernel the
more systems will be able to mount a filesystem with the changes when
they becomes widely used.

All of the other features that are going to be introduced will only going
to be applicable for format time (filesystems larger than 16TB), or if
exceeding limits of the current ext3 support (e.g. files larger than 2TB
in size).

> People (including me) still switch back and forth between ext2 and ext3 
> mounts of the same filesystem on occasion.  I think creating an "ext4" 
> would allow for greater developer flexibility in implementing new 
> features and ditching old ones -- while also emphasizing to the user 
> that switching back and forth between ext4 and ext[23] is a bad idea.

While this is partly true, one of the big benefits is that you can
transparently upgrade your system to use the new features and improve
performance without a long outage window.  Having a completely separate
ext4 filesystem doesn't improve the compatibility story at all.  There
has been renewed discussion on implementing "mounting ext3 without a
journal", just for a recovery mode, because ext2 will not be modified
to get all of these features (running e2fsck on a huge filesystem each
reboot would be insane).

> Overall, after applying extent (and 48bit) patches, I think it is wrong 
> to keep calling it ext3.  That will break some existing user 
> assumptions, and continue to restrict developers' freedom to implement 
> nifty new features.

Just FYI, all of the ext3 developers are on board with this patch series
and it has been discussed and reviewed for many weeks already, it isn't
just being pushed by one party.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
  2006-06-09  2:40 ` Valdis.Kletnieks
  2006-06-09  2:49 ` Jeff Garzik
@ 2006-06-09  9:13 ` Christoph Hellwig
  2006-06-09 10:07   ` Andrew Morton
                     ` (2 more replies)
  2006-06-30  0:16 ` [RFC][Update 0/16]extents and 48bit ext3/4 patches Mingming Cao
                   ` (16 subsequent siblings)
  19 siblings, 3 replies; 296+ messages in thread
From: Christoph Hellwig @ 2006-06-09  9:13 UTC (permalink / raw)
  To: Mingming Cao; +Cc: linux-kernel, ext2-devel, linux-fsdevel

On Thu, Jun 08, 2006 at 06:20:54PM -0700, Mingming Cao wrote:
> Current ext3 filesystem is limited to 8TB(4k block size), this is
> practically not enough for the increasing need of bigger storage as
> disks in a few years (or even now).
> 
> To address this need, there are co-effort from RedHat, ClusterFS, IBM
> and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> ext3 is build on top of extent map changes for ext3, originally from
> Alex Tomas. In short, the new ext3 on-disk extents format is:

What a horrible idea!  The nice things about ext3 are:

 - the rather simple and thus reliable implementation
 - the lack of incompatible ondisk changes

and the block numbers are't the big problem concerning scalability, there's
a lot more to it, like btree(-like) structures in the allocator, parallel
alloocator algorithms and a better allocation group concept.

If you guys want big storage on linux please help improving the filesystems
design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
both making ext3 less reliable for us desktop/small server users and not get
the full thing for the big storage people either.


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  9:13 ` Christoph Hellwig
@ 2006-06-09 10:07   ` Andrew Morton
  2006-06-09 15:40     ` Jeff Garzik
  2006-06-09 10:49   ` Andreas Dilger
  2006-06-09 11:26   ` Alex Tomas
  2 siblings, 1 reply; 296+ messages in thread
From: Andrew Morton @ 2006-06-09 10:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, ext2-devel, cmm, linux-kernel

On Fri, 9 Jun 2006 10:13:27 +0100
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Jun 08, 2006 at 06:20:54PM -0700, Mingming Cao wrote:
> > Current ext3 filesystem is limited to 8TB(4k block size), this is
> > practically not enough for the increasing need of bigger storage as
> > disks in a few years (or even now).
> > 
> > To address this need, there are co-effort from RedHat, ClusterFS, IBM
> > and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> > expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> > ext3 is build on top of extent map changes for ext3, originally from
> > Alex Tomas. In short, the new ext3 on-disk extents format is:
> 
> What a horrible idea!  The nice things about ext3 are:
> 
>  - the rather simple and thus reliable implementation

JBD isn't simple.  I don't think there's a need in this project to make
algorithmic changes in either JBD or htree, thankfully.

>  - the lack of incompatible ondisk changes

Ted&co have been pretty good at avoiding compatibility problems.

> and the block numbers are't the big problem concerning scalability, there's
> a lot more to it, like btree(-like) structures in the allocator, parallel
> alloocator algorithms and a better allocation group concept.

The performance testing results I've seen for a few of the components of
this project have been rather good, and that's the bottom line.

I don't know how the end result would compare in a bakeoff against XFS, and
I doubt if we know how much XFS performance would be improved if this
effort were diverted into that project.

But I don't think it's all as clear-cut as you imply.

> If you guys want big storage on linux please help improving the filesystems
> design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
> both making ext3 less reliable for us desktop/small server users and not get
> the full thing for the big storage people either.

There have been pretty big changes in ext3 post-2.6.early and we've been OK
at avoiding breakage thus far.  It all comes down to how well the new
codepaths manage to avoid altering the existing ones.

That being said, ext3 isn't exactly ....  modern.  One day we'll need
something better.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  9:13 ` Christoph Hellwig
  2006-06-09 10:07   ` Andrew Morton
@ 2006-06-09 10:49   ` Andreas Dilger
  2006-06-09 11:26   ` Alex Tomas
  2 siblings, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 10:49 UTC (permalink / raw)
  To: Christoph Hellwig, Mingming Cao, linux-kernel, ext2-devel,
	linux-fsdevel

On Jun 09, 2006  10:13 +0100, Christoph Hellwig wrote:
> the block numbers are't the big problem concerning scalability, there's
> a lot more to it, like btree(-like) structures in the allocator, parallel
> alloocator algorithms and a better allocation group concept.

All of the allocator changes are already written and well tested, and gave
ext3 a 30% performance improvement while at the same time reducing CPU
usage by 50% - not trivial.  See Holger Kiehl's post
http://marc.theaimsgroup.com/?l=linux-kernel&m=114958967600822&w=4

> If you guys want big storage on linux please help improving the filesystems
> design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
> both making ext3 less reliable for us desktop/small server users and not get
> the full thing for the big storage people either.

XFS = 108844 lines, and a complete mess to understand.
ext3+jbd = 27749 lines (includes ~6000 lines extent/allocation changes)

Also, ext3 is just much more robust in the face of on-disk corruption
than xfs or jfs because of its "static" layout, and e2fsck is way
better than the alternatives.  Despite their "big storage" designs ext3
is still competitive in performance, especially with the allocation
improvements.  See also:

http://marc.theaimsgroup.com/?l=ext2-devel&m=108194477207334&w=4
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=110112879929869&w=4
	http://samba.org/~tridge/xattr_results/xfs-ext3-tuning.png

If the XFS or JFS maintainers want to fix their filesystems, they are free
to do so, the ext3 maintainers (all of them, btw) want these changes.

The extent code (prior to some minor cleanups for landing on the vanilla
kernel) has already seen many millions of hours of testing in very heavy
IO environments, so it isn't something that was just written.  If extents
aren't enabled it amounts to a couple of extra conditionals in the
allocation path and basically no modifications to the existing code, so
you can safely avoid this code if you feel the need to.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  9:13 ` Christoph Hellwig
  2006-06-09 10:07   ` Andrew Morton
  2006-06-09 10:49   ` Andreas Dilger
@ 2006-06-09 11:26   ` Alex Tomas
  2006-06-09 14:23     ` [Ext2-devel] " Jeff Garzik
  2 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 11:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, ext2-devel, Mingming Cao, linux-kernel

>>>>> Christoph Hellwig (CH) writes:

 CH> If you guys want big storage on linux please help improving the filesystems
 CH> design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
 CH> both making ext3 less reliable for us desktop/small server users and not get
 CH> the full thing for the big storage people either.

proposed patches don't touch existing code paths.
extents may be enabled/disabled on per-file basis.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 11:26   ` Alex Tomas
@ 2006-06-09 14:23     ` Jeff Garzik
  2006-06-09 14:33       ` Alex Tomas
  2006-06-09 14:34       ` Alex Tomas
  0 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 14:23 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Christoph Hellwig, Mingming Cao, linux-kernel, ext2-devel,
	linux-fsdevel

Alex Tomas wrote:
>>>>>> Christoph Hellwig (CH) writes:
> 
>  CH> If you guys want big storage on linux please help improving the filesystems
>  CH> design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
>  CH> both making ext3 less reliable for us desktop/small server users and not get
>  CH> the full thing for the big storage people either.
> 
> proposed patches don't touch existing code paths.
> extents may be enabled/disabled on per-file basis.

And thus, inodes are progressively incompatible with older kernels. 
Boot into an older kernel, and you can now only read half your 
filesystem (if it even allows mount at all).

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 14:23     ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 14:33       ` Alex Tomas
  2006-06-09 14:34       ` Alex Tomas
  1 sibling, 0 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 14:33 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alex Tomas, Christoph Hellwig, Mingming Cao, linux-kernel,
	ext2-devel, linux-fsdevel

>>>>> Jeff Garzik (JG) writes:

 JG> And thus, inodes are progressively incompatible with older
 JG> kernels. Boot into an older kernel, and you can now only read half
 JG> your filesystem (if it even allows mount at all).

nope, you aren't allowed to mount fs with extents-enabled files
by ext3 which has no the feature compiled in. the same will
happen if you call it ext4.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 14:23     ` [Ext2-devel] " Jeff Garzik
  2006-06-09 14:33       ` Alex Tomas
@ 2006-06-09 14:34       ` Alex Tomas
  2006-06-09 14:35         ` Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 14:34 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: ext2-devel, linux-kernel, Christoph Hellwig, Mingming Cao,
	linux-fsdevel, Alex Tomas

>>>>> Jeff Garzik (JG) writes:

 JG> And thus, inodes are progressively incompatible with older
 JG> kernels. Boot into an older kernel, and you can now only read half
 JG> your filesystem (if it even allows mount at all).

nope, you aren't allowed to mount fs with extents-enabled files
by ext3 which has no the feature compiled in. the same will
happen if you call it ext4.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 14:34       ` Alex Tomas
@ 2006-06-09 14:35         ` Jeff Garzik
  2006-06-09 14:57           ` Alex Tomas
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 14:35 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Christoph Hellwig, linux-fsdevel, ext2-devel, Mingming Cao,
	linux-kernel

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> And thus, inodes are progressively incompatible with older
>  JG> kernels. Boot into an older kernel, and you can now only read half
>  JG> your filesystem (if it even allows mount at all).
> 
> nope, you aren't allowed to mount fs with extents-enabled files
> by ext3 which has no the feature compiled in. the same will
> happen if you call it ext4.

This is my point...  why increase user confusion by calling it ext3, then?

Extent magnify the "what ext3 filesystem am I talking to, today?" problem.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 14:35         ` Jeff Garzik
@ 2006-06-09 14:57           ` Alex Tomas
  2006-06-09 15:17             ` [Ext2-devel] " Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 14:57 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: ext2-devel, linux-kernel, Christoph Hellwig, Mingming Cao,
	linux-fsdevel, Alex Tomas

>>>>> Jeff Garzik (JG) writes:

 JG> Alex Tomas wrote:
 >>>>>>> Jeff Garzik (JG) writes:
 JG> And thus, inodes are progressively incompatible with older
 JG> kernels. Boot into an older kernel, and you can now only read half
 JG> your filesystem (if it even allows mount at all).
 >> nope, you aren't allowed to mount fs with extents-enabled files
 >> by ext3 which has no the feature compiled in. the same will
 >> happen if you call it ext4.

 JG> This is my point...  why increase user confusion by calling it ext3, then?

by default it's still old good ext3 without extents. user should
enable it explicitly. for him, this means the feature is ready
to be used anytime. the only thing he needs is to (re)mount fs
with the option. for us, this means: a) a single source tree -
easy to maintain b) we must be clear with user that the feature
isn't backward compatible

thanks, Alex

PS. in the end this is just ext3 with one more feature ...

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  8:35   ` Andreas Dilger
@ 2006-06-09 15:08     ` Jeff Garzik
  2006-06-09 15:25       ` Jeff Garzik
                         ` (2 more replies)
  0 siblings, 3 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:08 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: cmm, Andrew Morton, Linus Torvalds, linux-kernel, ext2-devel,
	linux-fsdevel

Please fix your mailer to stop creating bogus Mail-Followup-To headers, 
headers which exclude the original poster, and cause compliant MUAs to 
incorrectly build To/CC.

Andreas Dilger wrote:
> On Jun 08, 2006  22:49 -0400, Jeff Garzik wrote:
>> One of my common complaints about massive ext3 updates such as this is 
>> the ever-growing "which ext3 filesystem am I mounting?" problem.
>>
>> I really think extents and 48bit-ness should imply
>> 	cp -a fs/ext3 fs/ext4
>> and go from there.
> 
> The problem with this approach (as seen with ext2 and ext3) is that one
> tree or the other gets stale w.r.t. bug fixes and now we have the case
> where ext2 has a noticably different implementation in some areas and
> bug fixes are no longer trivial to apply to both trees.
> 
> I think all of the ext3 maintainers think this split was a bad idea in
> hindsight, and having an ext3 mode where it can mount without a journal
> would be much more desirable.

Please look beyond just ext2/3.  Other filesystems which have "version 
1", "version 2", "version 3", ... formats are all nasty as hell.  The 
end-result bloated code essentially supports several filesystems, all 
within the same code base, and its a nightmare of ugliness.

Further, its not only bloated, but slow.  The code inevitably winds up 
in one of two forms:

	if (spiffy new-feature metadata)
		...
	else if (updated metadata)
		...
	else /* original metadata */
		...

_or_ you add a level of indirection, by creating internal-to-the-fs 
pointer operations.

Stuffing more and more features into fs/ext3 means you are following the 
path that leads to reiser4...  where EVERYTHING under the hood is 
mutable, all within fs/ext3.

>> IMHO the ext3 back-compat situation is already really hairy, with all 
>> the features added since the original ext3 release.
> 
> While partially true, ext2/ext3 has a very good history w.r.t. compatibility
> (with one exception being the EAs on symlinks problem that slipped through
> with selinux).
> 
> Yes, the extents format will be incompatible with older ext3, but it isn't
> enabled by default so it will be completely up to the sysadmin when they
> make their filesystem incompatible.  They also won't impact any existing
> files.  The earlier extents support gets into a kernel.org kernel the
> more systems will be able to mount a filesystem with the changes when
> they becomes widely used.
> 
> All of the other features that are going to be introduced will only going
> to be applicable for format time (filesystems larger than 16TB), or if
> exceeding limits of the current ext3 support (e.g. files larger than 2TB
> in size).

Yet more progressive incompatibility, yet more

	if (metadata v2)
		...
	else /* metadata v1 */
		...

Why do you insist upon calling the end result ext3, when the truth is 
that you are slowing rewriting ext3?

As time progresses, more and more admins must ask themselves the 
question "what flavor of ext3 filesystem is on my hard drive?"

Here's a key question for ext3 developers, which I bet has no answer: 
when is it enough?  Is the plan to continually introduce incompatible 
features into ext3, over time, ad infinitum?

>> People (including me) still switch back and forth between ext2 and ext3 
>> mounts of the same filesystem on occasion.  I think creating an "ext4" 
>> would allow for greater developer flexibility in implementing new 
>> features and ditching old ones -- while also emphasizing to the user 
>> that switching back and forth between ext4 and ext[23] is a bad idea.
> 
> While this is partly true, one of the big benefits is that you can
> transparently upgrade your system to use the new features and improve
> performance without a long outage window.  Having a completely separate

Changing the name to ext4 doesn't erase this capability.

> ext4 filesystem doesn't improve the compatibility story at all.  There
> has been renewed discussion on implementing "mounting ext3 without a
> journal", just for a recovery mode, because ext2 will not be modified
> to get all of these features (running e2fsck on a huge filesystem each
> reboot would be insane).

So now you are going backwards, and implementing ext2-within-ext3?

Are you ready to admit, yet, that ext3 is 100% mutable in the minds of 
ext3 developers?  Why not implement the minix filesystem format within 
ext3, at this point?  We could call it a "plugin", I bet.

>> Overall, after applying extent (and 48bit) patches, I think it is wrong 
>> to keep calling it ext3.  That will break some existing user 
>> assumptions, and continue to restrict developers' freedom to implement 
>> nifty new features.
> 
> Just FYI, all of the ext3 developers are on board with this patch series
> and it has been discussed and reviewed for many weeks already, it isn't
> just being pushed by one party.

That is completely irrelevant to this thread.

If all the ext3 developers are on board, that just implies that there is 
no clear definition of what "ext3" really means.  With this patch 
series, and with future plans described here and elsewhere, the name 
"ext3" will become more and more meaningless.  It could mean _any_ of 
several filesystem metadata variants, and the admin will have no clue 
which variant they are talking to until they try to mount the blkdev 
(and possibly fail the mount).

At SOME point, clueful developers will say "we should better concentrate 
our energy on a new filesystem."

But I see no one at all defining that "some point."

At some point you are beating a dead horse.  At some point, you are 
pushing features into a filesystem that was never designed to support 
said features.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 14:57           ` Alex Tomas
@ 2006-06-09 15:17             ` Jeff Garzik
  2006-06-09 16:21               ` Mike Snitzer
  2006-06-09 16:56               ` Andreas Dilger
  0 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:17 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Christoph Hellwig, Mingming Cao, linux-kernel, ext2-devel,
	linux-fsdevel

Alex Tomas wrote:
> PS. in the end this is just ext3 with one more feature ...

Incorrect.  You have to look at ext3 development over time.  This is a 
PATTERN with ext3 development:  mutating the metadata over time in a 
progressively incompatible manner.

You have this thing called "ext3", which fools an admin into thinking 
they can use their filesystem with any kernel that has "ext3" support. 
That's somewhat true today, but with extents it will become false. 
Having a mutating definition of "ext3" is a convenience for developers, 
and for users WHO ONLY MOVE FORWARD in kernel versions.

A 48bit ext3 filesystem with extents is completely unusable in 2.4.30's 
"ext3" or 2.6.10's "ext3".  Users are forced to hunt down the specific 
kernel version when an incompatible feature was added to ext3.  How can 
that possibly be described as "user friendly"?

"Which ext3 am I talking to, today?"
"And which kernels am I locked into, in order to talk to my filesystem?"

Not all users are big production houses that plan their filesystem 
metadata migration months in advance!  I _guarantee_ some users will 
boot into ext3-with-extents, use it for a while, and then try to 
downgrade for whatever reason...  only to find they have been LOCKED 
OUT.  That is a very real world situation, guys.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  2:40 ` Valdis.Kletnieks
  2006-06-09  8:20   ` Andreas Dilger
@ 2006-06-09 15:23   ` Mingming Cao
  1 sibling, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-09 15:23 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-fsdevel, ext2-devel, linux-kernel

Valdis.Kletnieks@vt.edu wrote:
> On Thu, 08 Jun 2006 18:20:54 PDT, Mingming Cao said:
> 
>>Current ext3 filesystem is limited to 8TB(4k block size), this is
>>practically not enough for the increasing need of bigger storage as
>>disks in a few years (or even now).
>>
>>To address this need, there are co-effort from RedHat, ClusterFS, IBM
>>and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
>>expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
>>ext3 is build on top of extent map changes for ext3, originally from
>>Alex Tomas. In short, the new ext3 on-disk extents format is:
> 
> 
> which implies matching changes to mkfs.ext2 and possibly mount..
> 
> 
Alexandre Ratchov and Laurent Vivier from BULL have been done some work 
in e2fsprog to support extents and 48/64 bit ext3, although the patches 
have not been thoroughly reviewed and discussed yet...

http://marc.theaimsgroup.com/?l=ext2-devel&m=114848122624510&w=2

>>Appreciate any comments and feedbacks!
> 
> 
> Somebody else was recently discussing a set of patches to ext3 for
> extents+delalloc+mballoc patches - is this work compatible with that?
> 
Yes, the extents patch you mentioned is the same one included in the 
series. The delalloc (support delayed allocation for ext3) and mballoc ( 
support multiple block allocation based on extents) are considered a 
future to add, as this series is intend to address the capability issue 
and on-disk format only.

> Also, a pointer to the matching userspace patches would help anybody
> who's gung-ho enough to test the code....
>

Thanks for your interest!

We have tested patch 1-4 (which basically not touching any on-disk 
format) and they have been in mm tree. Extent patch itself have been 
tested for a long time by ClusterFS and IBM, as it's actually being 
posted a while back.

At this point the whole series pass compile, but not being tested yet. 
This post as a RFC is intend to collect comments and feedbacks. BULL 
team has done some test on the 2.6.16 version of the series with the 
e2fsprog changes they posted though. I will upload the matching 
e2fsprogs changes to ext2.sf.net/48bitsext3 shortly..

Mingming

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:08     ` Jeff Garzik
@ 2006-06-09 15:25       ` Jeff Garzik
  2006-06-09 15:40         ` Linus Torvalds
  2006-06-10 19:10         ` Kyle Moffett
  2006-06-09 15:28       ` [Ext2-devel] " Alex Tomas
  2006-06-09 20:32       ` Stephen C. Tweedie
  2 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:25 UTC (permalink / raw)
  To: linux-kernel, ext2-devel, linux-fsdevel
  Cc: Andrew Morton, Linus Torvalds, cmm, Andreas Dilger

Overall, I'm surprised that ext3 developers don't see any of the 
problems related to progressive, stealth filesystem upgrades.

Users are never given a clear indication of when their metadata is being 
upgraded, there is no clear "line of demarcation" they cross, when they 
start using extents.

Since there is no user-visible fs upgrade event, users do not have a 
clear picture of what features are being used -- which means they are 
kept in the dark about which kernels are OK to use on their data.

Do you guys honestly expect users to keep track of which kernels added 
specific ext3 features?

This is why other enterprise filesystems have clear "fs version 1", "fs 
version 2" points across which a user migrates.  ext3's feature-flags 
approach just means that there are a million combinations of potential 
old-and-new features, in-tree and third party, all of which must be 
supported.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:08     ` Jeff Garzik
  2006-06-09 15:25       ` Jeff Garzik
@ 2006-06-09 15:28       ` Alex Tomas
  2006-06-09 15:31         ` Matthew Wilcox
                           ` (2 more replies)
  2006-06-09 20:32       ` Stephen C. Tweedie
  2 siblings, 3 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 15:28 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andreas Dilger, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel

 JG> "ext3" will become more and more meaningless.  It could mean _any_ of 
 JG> several filesystem metadata variants, and the admin will have no clue 
 JG> which variant they are talking to until they try to mount the blkdev 
 JG> (and possibly fail the mount).

debugfs <dev> -R stats | grep features ?


thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:28       ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 15:31         ` Matthew Wilcox
  2006-06-10  3:26           ` Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3) Valerie Henson
  2006-06-09 15:44         ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
  2006-06-09 15:53         ` [Ext2-devel] " Gerrit Huizenga
  2 siblings, 1 reply; 296+ messages in thread
From: Matthew Wilcox @ 2006-06-09 15:31 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 07:28:22PM +0400, Alex Tomas wrote:
>  JG> "ext3" will become more and more meaningless.  It could mean _any_ of 
>  JG> several filesystem metadata variants, and the admin will have no clue 
>  JG> which variant they are talking to until they try to mount the blkdev 
>  JG> (and possibly fail the mount).
> 
> debugfs <dev> -R stats | grep features ?

... a simple and intuitive command which just trips off the tongue.

I want extents, but I'm still unconvinced that ext3 needs to grow beyond
32-bit blocks.  The scheme posted by Val and Arjan (with the
continuation inodes) seems much neater.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 10:07   ` Andrew Morton
@ 2006-06-09 15:40     ` Jeff Garzik
  2006-06-09 15:42       ` Matthew Wilcox
                         ` (2 more replies)
  0 siblings, 3 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, cmm, linux-kernel, ext2-devel, linux-fsdevel

Andrew Morton wrote:
> Ted&co have been pretty good at avoiding compatibility problems.

Well, extents and 48bit make that track record demonstrably worse.

Users are now forced to remember that, if they write to their filesystem 
after using either $mmver or $korgver kernels, they are locked out of 
using older kernels.

 From the user's perspective, ext3 has no clear "metadata version 1", 
"metadata version 2" division.  Thus they are now forced to keep a 
matrix of kernel versions and ext3 feature flag support, to know which 
kernels are usable with which data.  It is a support nightmare.

At no point is a user ever told, in big capital letters, "IF YOU WRITE 
TO THIS FILESYSTEM, YOU CAN'T BOOT OLDER KERNELS."  There is no "click 
OK to continue with this dramatic event."

And as features continue to be added in this manner, this problem gets 
_exponentially_ worse.

On the project management side of things, I see no indication that this 
momentum slow -- which implies to me that people will keep slapping new 
stuff into ext3, rather than directing energy towards a newer, cleaner 
ext-NG filesystem.

Dragging around back-compat really constrains freedom, and you have to 
have some sort of "pressure relief valve" (a massive, wildly 
incompatible update) eventually.

In my mind, it's analagous to locking developers into developing and 
deploying new features into a stable branch of software.  The hacks just 
get worse and worse, as you bend over backwards for back-compat.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:25       ` Jeff Garzik
@ 2006-06-09 15:40         ` Linus Torvalds
  2006-06-09 15:47           ` Jeff Garzik
                             ` (2 more replies)
  2006-06-10 19:10         ` Kyle Moffett
  1 sibling, 3 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 15:40 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: linux-kernel, ext2-devel, linux-fsdevel, Andreas Dilger, cmm,
	Andrew Morton

On Fri, 9 Jun 2006, Jeff Garzik wrote:
>
> Overall, I'm surprised that ext3 developers don't see any of the problems
> related to progressive, stealth filesystem upgrades.

Hey, they're used to it - they've been doing it for a long time.

In fact, ext3 wouldn't be ext3 unless I (and perhaps a few others) had 
insisted on it. People wanted to try to upgrade ext2 in place.

And they've been upgrading it in-place for a long time.

Now, there are unquestionably advantages to that approach too, but as you 
say, there are absolutely tons of disadvantages too. Bugs get much much 
subtler, and more disastrous for old users that don't even want the new 
features.

Quite frankly, at this point, there's no way in hell I believe we can do 
major surgery on ext3. It's the main filesystem for a lot of users, and 
it's just not worth the instability worries unless it's something very 
obviously transparent.

I wouldn't mind an ext4 (that hopefully drops some of the features of 
ext3, and might not downgrade to ext2 on errors, for example).

			Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:40     ` Jeff Garzik
@ 2006-06-09 15:42       ` Matthew Wilcox
  2006-06-09 15:51         ` Jeff Garzik
  2006-06-09 17:29         ` Alan Cox
  2006-06-09 16:56       ` Andrew Morton
  2006-06-09 18:23       ` Michael Poole
  2 siblings, 2 replies; 296+ messages in thread
From: Matthew Wilcox @ 2006-06-09 15:42 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Christoph Hellwig, cmm,
	linux-fsdevel

On Fri, Jun 09, 2006 at 11:40:03AM -0400, Jeff Garzik wrote:
> Users are now forced to remember that, if they write to their filesystem 
> after using either $mmver or $korgver kernels, they are locked out of 
> using older kernels.
> 
> From the user's perspective, ext3 has no clear "metadata version 1", 
> "metadata version 2" division.  Thus they are now forced to keep a 
> matrix of kernel versions and ext3 feature flag support, to know which 
> kernels are usable with which data.  It is a support nightmare.

Hang on, you're going too far.  You have to enable extents with the
extent mount option.  Otherwise you don't get to use them.  The user
does, in fact, have a clear division, although maybe the blinky signs
aren't quite luminous enough.

I still think making ext3 bigger than 16TB is just silly.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:28       ` [Ext2-devel] " Alex Tomas
  2006-06-09 15:31         ` Matthew Wilcox
@ 2006-06-09 15:44         ` Jeff Garzik
  2006-06-09 15:53           ` Alex Tomas
  2006-06-09 18:29           ` Andreas Dilger
  2006-06-09 15:53         ` [Ext2-devel] " Gerrit Huizenga
  2 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:44 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andreas Dilger, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel

Alex Tomas wrote:
>  JG> "ext3" will become more and more meaningless.  It could mean _any_ of 
>  JG> several filesystem metadata variants, and the admin will have no clue 
>  JG> which variant they are talking to until they try to mount the blkdev 
>  JG> (and possibly fail the mount).
> 
> debugfs <dev> -R stats | grep features ?

The question is, do you

a) expect users to run this magic command, and DTRT or

b) watch users boot w/ extents, accidentally do something silly like 
writing data to a file, and become locked into a new subset of kernels?

The simple act of writing data to a file has become an _irrevocable 
filesystem upgrade event_.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:40         ` Linus Torvalds
@ 2006-06-09 15:47           ` Jeff Garzik
  2006-06-09 15:55             ` Alex Tomas
                               ` (2 more replies)
  2006-06-09 15:57           ` Jeff Garzik
  2006-06-09 16:10           ` [Ext2-devel] " Alex Tomas
  2 siblings, 3 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, ext2-devel, linux-fsdevel, Andreas Dilger, cmm,
	Andrew Morton

Linus Torvalds wrote:
> 
> On Fri, 9 Jun 2006, Jeff Garzik wrote:
>> Overall, I'm surprised that ext3 developers don't see any of the problems
>> related to progressive, stealth filesystem upgrades.
> 
> Hey, they're used to it - they've been doing it for a long time.

Agreed, but my argument is that extents are a Big Deal.

think about The Experience:  Suddenly users that could use 2.4.x and 
2.6.x are locked into 2.6.18+, by the simple and common act of writing 
to a file.

No bells and whistles go off...

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:42       ` Matthew Wilcox
@ 2006-06-09 15:51         ` Jeff Garzik
  2006-06-09 17:29         ` Alan Cox
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Christoph Hellwig, cmm, linux-kernel, ext2-devel,
	linux-fsdevel

Matthew Wilcox wrote:
> On Fri, Jun 09, 2006 at 11:40:03AM -0400, Jeff Garzik wrote:
>> Users are now forced to remember that, if they write to their filesystem 
>> after using either $mmver or $korgver kernels, they are locked out of 
>> using older kernels.
>>
>> From the user's perspective, ext3 has no clear "metadata version 1", 
>> "metadata version 2" division.  Thus they are now forced to keep a 
>> matrix of kernel versions and ext3 feature flag support, to know which 
>> kernels are usable with which data.  It is a support nightmare.
> 
> Hang on, you're going too far.  You have to enable extents with the
> extent mount option.  Otherwise you don't get to use them.  The user
> does, in fact, have a clear division, although maybe the blinky signs
> aren't quite luminous enough.

...and how are distros going to deploy this?  They are going to turn on 
extents by default.

And do we honestly think that is a scalable option _anyway_?  That will 
slowly bloat fstab and mount command lines with an ever-increasing list 
of options.

It's IMO better experience for the user, and gives the developers more 
freedom.Look, I _really_ want extents.  I am a big fan.  But I think 
that extents are good time to make a clean break, and let ext3 live as 
it is.  And it will let ext3 stabilize.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:53           ` Alex Tomas
@ 2006-06-09 15:52             ` Jeff Garzik
  2006-06-09 16:02               ` Alex Tomas
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:52 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> Alex Tomas wrote:
>  JG> "ext3" will become more and more meaningless.  It could mean
>  >> _any_ of  JG> several filesystem metadata variants, and the admin
>  >> will have no clue  JG> which variant they are talking to until they
>  >> try to mount the blkdev  JG> (and possibly fail the mount).
>  >> debugfs <dev> -R stats | grep features ?
> 
>  JG> The question is, do you
> 
>  JG> a) expect users to run this magic command, and DTRT or
> 
>  JG> b) watch users boot w/ extents, accidentally do something silly like
>  JG> writing data to a file, and become locked into a new subset of kernels?
> 
> at the moment there is no way to "boot w/ extents". you must enable
> them by mount option.

Think about how distros will deploy this feature.  Also, think about how 
scalable that line of thinking is...

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:44         ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
@ 2006-06-09 15:53           ` Alex Tomas
  2006-06-09 15:52             ` Jeff Garzik
  2006-06-09 18:29           ` Andreas Dilger
  1 sibling, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 15:53 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

>>>>> Jeff Garzik (JG) writes:

 JG> Alex Tomas wrote:
 JG> "ext3" will become more and more meaningless.  It could mean
 >> _any_ of  JG> several filesystem metadata variants, and the admin
 >> will have no clue  JG> which variant they are talking to until they
 >> try to mount the blkdev  JG> (and possibly fail the mount).
 >> debugfs <dev> -R stats | grep features ?

 JG> The question is, do you

 JG> a) expect users to run this magic command, and DTRT or

 JG> b) watch users boot w/ extents, accidentally do something silly like
 JG> writing data to a file, and become locked into a new subset of kernels?

at the moment there is no way to "boot w/ extents". you must enable
them by mount option.


thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:28       ` [Ext2-devel] " Alex Tomas
  2006-06-09 15:31         ` Matthew Wilcox
  2006-06-09 15:44         ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
@ 2006-06-09 15:53         ` Gerrit Huizenga
  2006-06-09 16:03           ` Jeff Garzik
  2006-06-09 16:09           ` Linus Torvalds
  2 siblings, 2 replies; 296+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 15:53 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, 09 Jun 2006 19:28:22 +0400, Alex Tomas wrote:
>  JG> "ext3" will become more and more meaningless.  It could mean _any_ of 
>  JG> several filesystem metadata variants, and the admin will have no clue 
>  JG> which variant they are talking to until they try to mount the blkdev 
>  JG> (and possibly fail the mount).
> 
> debugfs <dev> -R stats | grep features ?

Sounds similar to cat /proc/cpuinfo.  How *do* we deal with processors
which have all these many different features?  Probably better than we
would if each variant were viewed as a different architecture.

Jeff's approach taken to the rediculous would mean that we'd have
ext versions 1-40 by now at least.  I don't think that helps much,
either.

I think the ext2/3 team has done a great job of providing compatibility.
It isn't perfect compatibility forwards *and* backwards, but moving
forwards always seems to be pretty reasonable.

gerrit

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:47           ` Jeff Garzik
@ 2006-06-09 15:55             ` Alex Tomas
  2006-06-09 15:56               ` Jeff Garzik
  2006-06-09 16:01             ` Linus Torvalds
  2006-06-09 20:38             ` Stephen C. Tweedie
  2 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 15:55 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Andreas Dilger

>>>>> Jeff Garzik (JG) writes:

 JG> think about The Experience:  Suddenly users that could use 2.4.x and 
 JG> 2.6.x are locked into 2.6.18+, by the simple and common act of writing 
 JG> to a file.

sorry to repeat, but if they simple try 2.6.18, they won't get extents.
instead, they must specify extents mount option. and at this point
they must get clear that this is a way to get incompatible fs.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:55             ` Alex Tomas
@ 2006-06-09 15:56               ` Jeff Garzik
  2006-06-09 16:07                 ` Alex Tomas
  2006-06-09 20:52                 ` Stephen C. Tweedie
  0 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:56 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> think about The Experience:  Suddenly users that could use 2.4.x and 
>  JG> 2.6.x are locked into 2.6.18+, by the simple and common act of writing 
>  JG> to a file.
> 
> sorry to repeat, but if they simple try 2.6.18, they won't get extents.
> instead, they must specify extents mount option. and at this point
> they must get clear that this is a way to get incompatible fs.

Think about how this will be deployed in production, long term.

If extents are not made default at some point, then no one will use the 
feature, and it should not be merged.

And when extents are default, you have this blizzard-of-feature-flags 
stealth upgrade event occur _sometime_ after they boot into the new fs 
for the first time.  And then when they want to boot another kernel, 
they have to dig down a feature matrix, and figure out which ext3 
codebase will work for them.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:40         ` Linus Torvalds
  2006-06-09 15:47           ` Jeff Garzik
@ 2006-06-09 15:57           ` Jeff Garzik
  2006-06-09 16:10           ` [Ext2-devel] " Alex Tomas
  2 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel,
	Andreas Dilger

Linus Torvalds wrote:
> Quite frankly, at this point, there's no way in hell I believe we can do 
> major surgery on ext3. It's the main filesystem for a lot of users, and 
> it's just not worth the instability worries unless it's something very 
> obviously transparent.
> 
> I wouldn't mind an ext4 (that hopefully drops some of the features of 
> ext3, and might not downgrade to ext2 on errors, for example).

Certainly agreed, for all of this :)

I think that the lack of ext4 means people keep trying to stuff the 
wrong things into ext3.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:47           ` Jeff Garzik
  2006-06-09 15:55             ` Alex Tomas
@ 2006-06-09 16:01             ` Linus Torvalds
  2006-06-09 20:38             ` Stephen C. Tweedie
  2 siblings, 0 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:01 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel,
	Andreas Dilger

On Fri, 9 Jun 2006, Jeff Garzik wrote:
>
> Linus Torvalds wrote:
> > 
> > On Fri, 9 Jun 2006, Jeff Garzik wrote:
> > > Overall, I'm surprised that ext3 developers don't see any of the problems
> > > related to progressive, stealth filesystem upgrades.
> > 
> > Hey, they're used to it - they've been doing it for a long time.
> 
> Agreed, but my argument is that extents are a Big Deal.

I'm not arguing against you - I'm arguing with you.

I just tried to explain what you saw as "surprising" - the fact that ext3 
developers don't see this as a problem at all. They don't see it as a 
problem, because it's how they have always worked, since before ext3 was 
ext3, and it was just a crazy extension to ext2.

And yes, it's a serious problem. Ext3 is pretty damn messy. It's not as 
messy as some, but it sure has potential.

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:52             ` Jeff Garzik
@ 2006-06-09 16:02               ` Alex Tomas
  2006-06-09 16:04                 ` [Ext2-devel] " Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 16:02 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

>>>>> Jeff Garzik (JG) writes:

 JG> Alex Tomas wrote:
 >> at the moment there is no way to "boot w/ extents". you must enable
 >> them by mount option.

 JG> Think about how distros will deploy this feature.  Also, think about
 JG> how scalable that line of thinking is...

I may be wrong, but I tend to think if they're stupid enough to enable
experimental mount option by default, they can do s/ext3/ext4 as well.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:53         ` [Ext2-devel] " Gerrit Huizenga
@ 2006-06-09 16:03           ` Jeff Garzik
  2006-06-09 16:09           ` Linus Torvalds
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:03 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

Gerrit Huizenga wrote:
> Jeff's approach taken to the rediculous would mean that we'd have
> ext versions 1-40 by now at least.  I don't think that helps much,
> either.

That's plainly silly.  Like everything else in life, it is a balance of 
costs.

At some point, ext3's fs-feature-flag approach increases the 
combinations of metadata variants you must support exponentially.

Moving to extents and 48bit (which I want) is a big enough step that, 
IMO, some of the support costs become far more obvious.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:02               ` Alex Tomas
@ 2006-06-09 16:04                 ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:04 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andreas Dilger, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> Alex Tomas wrote:
>  >> at the moment there is no way to "boot w/ extents". you must enable
>  >> them by mount option.
> 
>  JG> Think about how distros will deploy this feature.  Also, think about
>  JG> how scalable that line of thinking is...
> 
> I may be wrong, but I tend to think if they're stupid enough to enable
> experimental mount option by default, they can do s/ext3/ext4 as well.

<sigh>  At some point in the future, it will not be experimental.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:56               ` Jeff Garzik
@ 2006-06-09 16:07                 ` Alex Tomas
  2006-06-09 16:09                   ` [Ext2-devel] " Jeff Garzik
  2006-06-09 18:04                   ` Matthew Frost
  2006-06-09 20:52                 ` Stephen C. Tweedie
  1 sibling, 2 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 16:07 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

>>>>> Jeff Garzik (JG) writes:

 JG> Think about how this will be deployed in production, long term.

 JG> If extents are not made default at some point, then no one will use
 JG> the feature, and it should not be merged.

sorry, I disagree. for example, NUMA isn't default and shouldn't be.
but we have it in the tree and any one may choose to use it. the same
with extents. let's have it in. but let's make clear it's experimental,
it makes sense for large files only, it isn't backward compatible and
so on.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:53         ` [Ext2-devel] " Gerrit Huizenga
  2006-06-09 16:03           ` Jeff Garzik
@ 2006-06-09 16:09           ` Linus Torvalds
  2006-06-09 17:58             ` Gerrit Huizenga
  1 sibling, 1 reply; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:09 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Alex Tomas, Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
	cmm, linux-fsdevel, Andreas Dilger

On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
> 
> Jeff's approach taken to the rediculous would mean that we'd have
> ext versions 1-40 by now at least.  I don't think that helps much,
> either.

On the other hand, I _guarantee_ you that it helps that we have ext2-3, 
and not just ext2 (nobody even tried to keep ext1 compatible, thank the 
Gods).

If for no other reason, than the fact that the ext3 development could be 
much more aggressive early on. Exactly because it did NOT screw up the old 
filesystem that everybody else depended on.

So we have empirical evidence that splitting filesystem work up does 
actually help. 

			Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:07                 ` Alex Tomas
@ 2006-06-09 16:09                   ` Jeff Garzik
  2006-06-09 18:04                   ` Matthew Frost
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:09 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> Think about how this will be deployed in production, long term.
> 
>  JG> If extents are not made default at some point, then no one will use
>  JG> the feature, and it should not be merged.
> 
> sorry, I disagree. for example, NUMA isn't default and shouldn't be.
> but we have it in the tree and any one may choose to use it. the same
> with extents. let's have it in. but let's make clear it's experimental,
> it makes sense for large files only, it isn't backward compatible and
> so on.

NUMA _is_ on by default, in newer hardware kernels :)  K8 is NUMA by 
default, remember.

But anyway...  the "it's experimental" argument is _completely_ 
irrelevant.  You have to think about the day when it is not, and how 
that will get deployed, and what are the potential problems that will 
arise from deployment.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:40         ` Linus Torvalds
  2006-06-09 15:47           ` Jeff Garzik
  2006-06-09 15:57           ` Jeff Garzik
@ 2006-06-09 16:10           ` Alex Tomas
  2006-06-09 16:10             ` Jeff Garzik
  2006-06-09 16:25             ` Linus Torvalds
  2 siblings, 2 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 16:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

>>>>> Linus Torvalds (LT) writes:

 LT> Quite frankly, at this point, there's no way in hell I believe we can do 
 LT> major surgery on ext3. It's the main filesystem for a lot of users, and 
 LT> it's just not worth the instability worries unless it's something very 
 LT> obviously transparent.

I believe it's as stable as before until you mount with extents
mount option.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:10           ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 16:10             ` Jeff Garzik
  2006-06-09 16:24               ` Erik Mouw
                                 ` (2 more replies)
  2006-06-09 16:25             ` Linus Torvalds
  1 sibling, 3 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:10 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Linus Torvalds (LT) writes:
> 
> 
>  LT> Quite frankly, at this point, there's no way in hell I believe we can do 
>  LT> major surgery on ext3. It's the main filesystem for a lot of users, and 
>  LT> it's just not worth the instability worries unless it's something very 
>  LT> obviously transparent.
> 
> I believe it's as stable as before until you mount with extents
> mount option.

If it will remain a mount option, if it is never made the default 
(either in kernel or distro level), then only 1% of users will ever use 
the feature.  And we shouldn't merge a 1% use feature into the _main_ 
filesystem for Linux.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:17             ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 16:21               ` Mike Snitzer
  2006-06-09 16:27                 ` Jeff Garzik
  2006-06-09 16:33                 ` Alex Tomas
  2006-06-09 16:56               ` Andreas Dilger
  1 sibling, 2 replies; 296+ messages in thread
From: Mike Snitzer @ 2006-06-09 16:21 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alex Tomas, Christoph Hellwig, Mingming Cao, linux-kernel,
	ext2-devel, linux-fsdevel

On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
> Alex Tomas wrote:
> > PS. in the end this is just ext3 with one more feature ...
>
> Incorrect.  You have to look at ext3 development over time.  This is a
> PATTERN with ext3 development:  mutating the metadata over time in a
> progressively incompatible manner.
>
> You have this thing called "ext3", which fools an admin into thinking
> they can use their filesystem with any kernel that has "ext3" support.
> That's somewhat true today, but with extents it will become false.
> Having a mutating definition of "ext3" is a convenience for developers,
> and for users WHO ONLY MOVE FORWARD in kernel versions.
>
> A 48bit ext3 filesystem with extents is completely unusable in 2.4.30's
> "ext3" or 2.6.10's "ext3".  Users are forced to hunt down the specific
> kernel version when an incompatible feature was added to ext3.  How can
> that possibly be described as "user friendly"?
>
> "Which ext3 am I talking to, today?"
> "And which kernels am I locked into, in order to talk to my filesystem?"
>
> Not all users are big production houses that plan their filesystem
> metadata migration months in advance!  I _guarantee_ some users will
> boot into ext3-with-extents, use it for a while, and then try to
> downgrade for whatever reason...  only to find they have been LOCKED
> OUT.  That is a very real world situation, guys.

Jeff,

I think all of us do understand what you're saying and on some level
are willing to accept that ext3-with-extents is in fact worthy of
branching to ext4, hence the url that has hosted the development of
extents (mballoc, delalloc, 48bit etc):
http://www.bullopensource.org/ext4/

But it _seems_ you're trying to paint ALL the ext3-developers as a
narrow minded lot.  If and when users decide to enable ext3 extents on
their filesystems they will presumably understand that doing so
precludes their ability to boot older kernels (steps can be taken to
make them well aware of this). The "real world situation" you refer
to, while hypothetically valid, isn't something informed
ext3-with-extents users will _ever_ elect to do.

Once a compelling feature is introduced Linux users embrace it and
never look back (provided it is stable!).  The real risk is the
(in)stability of all these ext3 improvements.  Stability is obviously
a requirement for merging these changes but I for one find it
refreshing that the current desire is to merge extents with ext3
(implicitly speaks to its stability when you couple that desire with
the fact that so many ext3 stakeholders are onboard!).

And as an aside, merging extents with ext3 forces ext3-developers to
be somewhat conservative about what bells and whistles they'd be
introducing moving forward.  The worst thing would be for these ext3
improvements to get merged into a new ext4 that becomes wildly known
as "the experimental ext3++".  I suppose developer discipline would
prevent such an unfortunate distinction but a new ext4 sandbox _could_
open the flood gates.

Developers never _want_ to branch (maintenance-hell), the question
becomes: do the risks associated with ext3-with-extents' backword
incompatibility _really_ justify the branch?

Mike

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:10             ` Jeff Garzik
@ 2006-06-09 16:24               ` Erik Mouw
  2006-06-09 16:28                 ` Jeff Garzik
  2006-06-09 16:24               ` [Ext2-devel] " Chase Venters
  2006-06-09 16:25               ` Alex Tomas
  2 siblings, 1 reply; 296+ messages in thread
From: Erik Mouw @ 2006-06-09 16:24 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 12:10:59PM -0400, Jeff Garzik wrote:
> Alex Tomas wrote:
> > I believe it's as stable as before until you mount with extents
> > mount option.
> 
> If it will remain a mount option, if it is never made the default 
> (either in kernel or distro level), then only 1% of users will ever use 
> the feature.  And we shouldn't merge a 1% use feature into the _main_ 
> filesystem for Linux.

Why not? That's how htree dir indexing got in, and AFAIK most distros
use it as a default.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:10             ` Jeff Garzik
  2006-06-09 16:24               ` Erik Mouw
@ 2006-06-09 16:24               ` Chase Venters
  2006-06-09 16:25               ` Alex Tomas
  2 siblings, 0 replies; 296+ messages in thread
From: Chase Venters @ 2006-06-09 16:24 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alex Tomas, Linus Torvalds, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel, Andreas Dilger

On Fri, 9 Jun 2006, Jeff Garzik wrote:

> Alex Tomas wrote:
>> > > > > >  Linus Torvalds (LT) writes:
>>
>> 
>> LT>  Quite frankly, at this point, there's no way in hell I believe we can 
>> LT>  do major surgery on ext3. It's the main filesystem for a lot of users, 
>> LT>  and it's just not worth the instability worries unless it's something 
>> LT>  very obviously transparent.
>>
>>  I believe it's as stable as before until you mount with extents
>>  mount option.
>
> If it will remain a mount option, if it is never made the default (either in 
> kernel or distro level), then only 1% of users will ever use the feature. 
> And we shouldn't merge a 1% use feature into the _main_ filesystem for Linux.

Pardon me because I haven't made it all the way through this discussion 
yet, so I don't know if this has been suggested or dismissed. But I'm 
curious - rather than 'stealth upgrade' by way of mount options, why not 
just enable the functionality either via tune2fs or mkfs.ext3?

New distribution versions could ship installers that enable it, because users 
aren't really going to switch from a new distribution they just install to 
an older version (same story on the kernel).

Users that want the functionality today can have it by asking for it with 
tune2fs, they just have to bypass the warning that tells them they're not 
going to be able to boot kernels before 2.6.xx

> 	Jeff

Cheers,
Chase

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:10             ` Jeff Garzik
  2006-06-09 16:24               ` Erik Mouw
  2006-06-09 16:24               ` [Ext2-devel] " Chase Venters
@ 2006-06-09 16:25               ` Alex Tomas
  2006-06-09 16:28                 ` Jeff Garzik
  2 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 16:25 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alex Tomas, Linus Torvalds, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel, Andreas Dilger

>>>>> Jeff Garzik (JG) writes:

 JG> If it will remain a mount option, if it is never made the default
 JG> (either in kernel or distro level), then only 1% of users will ever
 JG> use the feature.  And we shouldn't merge a 1% use feature into the
 JG> _main_ filesystem for Linux.

strictly speaking, not that many users really need >2TB fs ...

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:10           ` [Ext2-devel] " Alex Tomas
  2006-06-09 16:10             ` Jeff Garzik
@ 2006-06-09 16:25             ` Linus Torvalds
  2006-06-09 16:48               ` Alex Tomas
                                 ` (3 more replies)
  1 sibling, 4 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:25 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

On Fri, 9 Jun 2006, Alex Tomas wrote:
> 
> I believe it's as stable as before until you mount with extents
> mount option.

That's always a possibility in theory, and almost never in practice.

Btw, I don't care about extents _per_se_. I do care about the fact that 
people seem to think that code gets better as it supports more features. 
Not so.

The whole logic of "code sharing is good" is a huge mistake. Shared code 
is not at all better than individual code snippets, and often much much 
worse. In particular, if the shared code has separate code-paths, not just 
twice as complicated: it's _more_ than twice as bad, since it introduces 
the conditionals _and_ it introduces the very real risk of the conditional 
being taken the wrong way by mistake.

In contrast, the last time two different filesystems introduced bugs in 
each other was approximately "never". They simply don't modify each others 
code, they don't look at each others data structures, and they don't jump 
into each others routines.

So two separate filesystems are _less_ to maintain than one big one. Even 
if there's a lot of code that -could- be shared.

And no, extents in themselves aren't necessarily "the thing" that drives 
it from maintainable to unmaintainable. This crap grows over time. But I 
would _serious_ suggest that starting anew with a "new" filesystem, and 
taking the time to actually also get _rid_ of some of the baggage would 
quite likely be a good idea.

Just as an example: ext3 _sucks_ in many ways. It has huge inodes that 
take up way too much space in memory. It has absolutely disgusting code to 
handle directory reading and writing (buffer heads! In 2006!). It's 
conditional indexing code is horrible. Its performance absolutely sucks 
when the journal is being drained or something.

Are you going to improve on any of those _fundamnetal_ problems? Or are 
you going to make them worse?

Hint: I'm betting you're not going to improve them by adding more 
features.

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:21               ` Mike Snitzer
@ 2006-06-09 16:27                 ` Jeff Garzik
  2006-06-09 16:48                   ` Alex Tomas
  2006-06-09 16:33                 ` Alex Tomas
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:27 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Alex Tomas, Christoph Hellwig, Mingming Cao, linux-kernel,
	ext2-devel, linux-fsdevel

Mike Snitzer wrote:
> Developers never _want_ to branch (maintenance-hell), the question
> becomes: do the risks associated with ext3-with-extents' backword
> incompatibility _really_ justify the branch?


It's also a question of...  why keep adding modernizing features to 
ext3, thus keeping it on life support, but just barely?  If we are going 
to modernize the _main Linux filesystem_, let's not do it in a way that 
is slow, and ties our hands.

	Jeff



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:25               ` Alex Tomas
@ 2006-06-09 16:28                 ` Jeff Garzik
  2006-06-09 16:50                   ` Alex Tomas
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:28 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> If it will remain a mount option, if it is never made the default
>  JG> (either in kernel or distro level), then only 1% of users will ever
>  JG> use the feature.  And we shouldn't merge a 1% use feature into the
>  JG> _main_ filesystem for Linux.
> 
> strictly speaking, not that many users really need >2TB fs ...

Not true.  Terabyte SATA drives are less than a year away.  2TB 
drives... probably 2 years?

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:24               ` Erik Mouw
@ 2006-06-09 16:28                 ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:28 UTC (permalink / raw)
  To: Erik Mouw
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

Erik Mouw wrote:
> On Fri, Jun 09, 2006 at 12:10:59PM -0400, Jeff Garzik wrote:
>> Alex Tomas wrote:
>>> I believe it's as stable as before until you mount with extents
>>> mount option.
>> If it will remain a mount option, if it is never made the default 
>> (either in kernel or distro level), then only 1% of users will ever use 
>> the feature.  And we shouldn't merge a 1% use feature into the _main_ 
>> filesystem for Linux.
> 
> Why not? That's how htree dir indexing got in, and AFAIK most distros
> use it as a default.

The question is not today's usage, but long term production usage.  If 
it is destined to be default eventually, then it's not a 1% case.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:21               ` Mike Snitzer
  2006-06-09 16:27                 ` Jeff Garzik
@ 2006-06-09 16:33                 ` Alex Tomas
  2006-06-09 16:37                   ` [Ext2-devel] " Jeff Garzik
  2006-06-09 22:52                   ` Valdis.Kletnieks
  1 sibling, 2 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 16:33 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Garzik, ext2-devel, linux-kernel, Christoph Hellwig,
	Mingming Cao, linux-fsdevel, Alex Tomas

>>>>> Mike Snitzer (MS) writes:

 MS> precludes their ability to boot older kernels (steps can be taken to
 MS> make them well aware of this). The "real world situation" you refer
 MS> to, while hypothetically valid, isn't something informed
 MS> ext3-with-extents users will _ever_ elect to do.

one who needs/wants to go back may get rid of extents by:
a) remounting w/o extents option
b) copying new-fashion-style files so that copies use blockmap
c) dropping extents feature in superblock

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:33                 ` Alex Tomas
@ 2006-06-09 16:37                   ` Jeff Garzik
  2006-06-09 22:52                   ` Valdis.Kletnieks
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:37 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Mike Snitzer, Christoph Hellwig, Mingming Cao, linux-kernel,
	ext2-devel, linux-fsdevel

Alex Tomas wrote:
>>>>>> Mike Snitzer (MS) writes:
> 
>  MS> precludes their ability to boot older kernels (steps can be taken to
>  MS> make them well aware of this). The "real world situation" you refer
>  MS> to, while hypothetically valid, isn't something informed
>  MS> ext3-with-extents users will _ever_ elect to do.
> 
> one who needs/wants to go back may get rid of extents by:
> a) remounting w/o extents option
> b) copying new-fashion-style files so that copies use blockmap
> c) dropping extents feature in superblock

More likely, they will just backup+restore rather than go through all that.

After leafing through a 50-page manual to match up kernel versions with 
ext3 features, to see which older kernels will (or won't) require all 
this work.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:25             ` Linus Torvalds
@ 2006-06-09 16:48               ` Alex Tomas
  2006-06-09 16:54                 ` KELEMEN Peter
  2006-06-09 16:55                 ` Jeff Garzik
  2006-06-09 16:54               ` [Ext2-devel] " Linus Torvalds
                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 16:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

so, instead of taking one (quite-well-tested) part that solves one of
the biggest ext3 limitation, you propose to start a new project and
get something in a year (probably) ?

I think about extents as a step-by-step way ...

thanks, Alex

>>>>> Linus Torvalds (LT) writes:

 LT> Just as an example: ext3 _sucks_ in many ways. It has huge inodes that 
 LT> take up way too much space in memory. It has absolutely disgusting code to 
 LT> handle directory reading and writing (buffer heads! In 2006!). It's 
 LT> conditional indexing code is horrible. Its performance absolutely sucks 
 LT> when the journal is being drained or something.

 LT> Are you going to improve on any of those _fundamnetal_ problems? Or are 
 LT> you going to make them worse?

 LT> Hint: I'm betting you're not going to improve them by adding more 
 LT> features.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:27                 ` Jeff Garzik
@ 2006-06-09 16:48                   ` Alex Tomas
  2006-06-09 16:51                     ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 16:48 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Mike Snitzer, Alex Tomas, Christoph Hellwig, Mingming Cao,
	linux-kernel, ext2-devel, linux-fsdevel

>>>>> Jeff Garzik (JG) writes:

 JG> It's also a question of...  why keep adding modernizing features to
 JG> ext3, thus keeping it on life support, but just barely?  If we are
 JG> going to modernize the _main Linux filesystem_, let's not do it in a
 JG> way that is slow, and ties our hands.

I think trying to solve all problems at once will take much longer.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:28                 ` Jeff Garzik
@ 2006-06-09 16:50                   ` Alex Tomas
  2006-06-09 16:53                     ` [Ext2-devel] " Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 16:50 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

>>>>> Jeff Garzik (JG) writes:

 JG> Alex Tomas wrote:
 >>>>>>> Jeff Garzik (JG) writes:
 JG> If it will remain a mount option, if it is never made the
 >> default
 JG> (either in kernel or distro level), then only 1% of users will ever
 JG> use the feature.  And we shouldn't merge a 1% use feature into the
 JG> _main_ filesystem for Linux.
 >> strictly speaking, not that many users really need >2TB fs ...

 JG> Not true.  Terabyte SATA drives are less than a year away.  2TB
 JG> drives... probably 2 years?

oh, 2 years sound long enough for defaulting extents?

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:48                   ` Alex Tomas
@ 2006-06-09 16:51                     ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:51 UTC (permalink / raw)
  To: Alex Tomas
  Cc: ext2-devel, linux-kernel, Christoph Hellwig, Mingming Cao,
	linux-fsdevel

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> It's also a question of...  why keep adding modernizing features to
>  JG> ext3, thus keeping it on life support, but just barely?  If we are
>  JG> going to modernize the _main Linux filesystem_, let's not do it in a
>  JG> way that is slow, and ties our hands.
> 
> I think trying to solve all problems at once will take much longer.

I guess it's a good thing that real world development never works like that.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:50                   ` Alex Tomas
@ 2006-06-09 16:53                     ` Jeff Garzik
  2006-06-09 17:01                       ` Alex Tomas
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:53 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> Alex Tomas wrote:
>  >>>>>>> Jeff Garzik (JG) writes:
>  JG> If it will remain a mount option, if it is never made the
>  >> default
>  JG> (either in kernel or distro level), then only 1% of users will ever
>  JG> use the feature.  And we shouldn't merge a 1% use feature into the
>  JG> _main_ filesystem for Linux.
>  >> strictly speaking, not that many users really need >2TB fs ...
> 
>  JG> Not true.  Terabyte SATA drives are less than a year away.  2TB
>  JG> drives... probably 2 years?
> 
> oh, 2 years sound long enough for defaulting extents?

If terabyte drives will be here in less than a year, and 750GB drives 
are already here, then people with today's commodity hardware are 
probably already chomping at the bit to do >2TB LVM and RAID.

Hook eight 750GB SATA drives to a Marvell SATA controller (all 
commodity, all production) and you're way past 2TB.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:48               ` Alex Tomas
@ 2006-06-09 16:54                 ` KELEMEN Peter
  2006-06-09 16:55                 ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: KELEMEN Peter @ 2006-06-09 16:54 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

* Alex Tomas (alex@clusterfs.com) [20060609 20:48]:

> I think about extents as a step-by-step way ...

...so call it ext4 *now* and have a complete rewrite of the whole
codebase as ext5.  Users get what they want now (ext4) and Linus
gets what he wants later (ext5).  Extents are useful for Joe
Average User with <2 TB filesystems as well.

It's already funny enough that I'm using e2* tools for managing
ext3 filesystems...

Peter

-- 
    .+'''+.         .+'''+.         .+'''+.         .+'''+.         .+''
 Kelemen Péter     /       \       /       \     Peter.Kelemen@cern.ch
.+'         `+...+'         `+...+'         `+...+'         `+...+'


_______________________________________________
Ext2-devel mailing list
Ext2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ext2-devel

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:25             ` Linus Torvalds
  2006-06-09 16:48               ` Alex Tomas
@ 2006-06-09 16:54               ` Linus Torvalds
  2006-06-09 17:04                 ` Alex Tomas
                                   ` (2 more replies)
  2006-06-09 17:12               ` Jeff Anderson-Lee
  2006-06-09 18:02               ` Andrew Morton
  3 siblings, 3 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:54 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

On Fri, 9 Jun 2006, Linus Torvalds wrote:
> 
> Just as an example: ext3 _sucks_ in many ways. It has huge inodes that 
> take up way too much space in memory.

Btw, I'm not kidding you on this one.

THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!

And you know what? 2TB files are totally uninteresting to 99.9999% of all 
people. Most people find it _much_ more interesting to have hundreds of 
thousands of _smaller_ files instead.

So do this:

	cat /proc/slabinfo | grep ext3

and be absolutely disgusted and horrified by the size of those inodes 
already, and ask yourself whether extending the block size to 48 bits will 
help or further hurt one of the biggest problems of ext3 right now?

(And yes, I realize that block numbers are just a small part of it. The 
"vfs_inode" is also a real problem - it's got _way_ too many large 
list-heads that explode on a 64-bit kernel, for example. Oh, well. My 
point is that things like this can make a very real issue _worse_ for all 
the people who don't care one whit about it)

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:48               ` Alex Tomas
  2006-06-09 16:54                 ` KELEMEN Peter
@ 2006-06-09 16:55                 ` Jeff Garzik
  2006-06-09 17:12                   ` [Ext2-devel] " Alex Tomas
                                     ` (2 more replies)
  1 sibling, 3 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:55 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
> so, instead of taking one (quite-well-tested) part that solves one of
> the biggest ext3 limitation, you propose to start a new project and
> get something in a year (probably) ?
> 
> I think about extents as a step-by-step way ...

That is what the entirety of Linux development is -- step-by-step.

It is OBVIOUS that it would take five minutes to start ext4.

1) clone a new tree
2) cp -a fs/ext3 fs/ext4
3) apply extent and 48bit patches
4) apply related e2fsprogs patches

Then update ext4 step-by-step, using the normal Linux development process.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:40     ` Jeff Garzik
  2006-06-09 15:42       ` Matthew Wilcox
@ 2006-06-09 16:56       ` Andrew Morton
  2006-06-09 17:07         ` Jeff Garzik
  2006-06-09 18:23       ` Michael Poole
  2 siblings, 1 reply; 296+ messages in thread
From: Andrew Morton @ 2006-06-09 16:56 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: hch, cmm, linux-kernel, ext2-devel, linux-fsdevel

On Fri, 09 Jun 2006 11:40:03 -0400
Jeff Garzik <jeff@garzik.org> wrote:

> Users are now forced to remember that, if they write to their filesystem 
> after using either $mmver or $korgver kernels, they are locked out of 
> using older kernels.

The same happens if we create ext4 - earlier kernels don't support that,
either.

I suppose we could call it ext4, although that wouldn't make much
difference operationally.  The developers would probably choose to generate
ext4 from the same codebase as ext3 for maintainability reasons, rather
than choosing to copy-n-modify.  We'd need to see the patches to be able to
finally make that judgement.

> 
> And as features continue to be added in this manner, this problem gets 
> _exponentially_ worse.

"continue to be added"?  afaik this is the first time this has happened,
and there's no plan to do it again.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:17             ` [Ext2-devel] " Jeff Garzik
  2006-06-09 16:21               ` Mike Snitzer
@ 2006-06-09 16:56               ` Andreas Dilger
  2006-06-09 17:32                 ` [Ext2-devel] " Greg KH
  2006-06-09 18:48                 ` Jeff Garzik
  1 sibling, 2 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 16:56 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: ext2-devel, linux-kernel, Christoph Hellwig, Mingming Cao,
	linux-fsdevel, Alex Tomas

On Jun 09, 2006  11:17 -0400, Jeff Garzik wrote:
> Not all users are big production houses that plan their filesystem 
> metadata migration months in advance!  I _guarantee_ some users will 
> boot into ext3-with-extents, use it for a while, and then try to 
> downgrade for whatever reason...  only to find they have been LOCKED 
> OUT.  That is a very real world situation, guys.

Except that the only way that they will get extents is if they read some
documentation that tells them to mount with "-o extents", which will also
say "this is incompatible with older kernels - only use it if you aren't
going to revert to older kernels".  If they try to mount such a filesystem
it will report "trying to mount filesystem with incompatible feature",
and "e2fsprogs" will report "incompatible feature extents - please upgrade
your e2fsprogs" (for versions newer than Nov 2004).

It's a lot better than e.g. the latest ubuntu which (apparently,
I read) can't mount a kernel older than 2.6.15 because of udev (or
sysfs?) changes.  It's better than e.g. reiserfs vs. reiser4 compatibility
(which doesn't exist).  2.4 kernels probably can't mount a new udev root
filesystem because none of the /dev files exist either.  2.4 kernels can't
mount a filesystem that is using device mapper ("LVM 2.0") instead of
"LVM 1.0".  All 2.2 kernel.org kernels couldn't use any system with RAID,
because any distro worth its salt had upgraded the RAID code to a working
(incompatible) version.

Nobody is forcing users to use extents.   Same with large inodes in ext3,
which give a 7x speedup in samba4 performance - did this cause you any
heartburn yet?   Large inodes + fast EAs are available for people who want
to use it for a couple of years already, will soon allow nanosecond times
and maybe one day in the distant future it will become the default but not
yet.  In a few years, the support for extents in ext3 will be pervasive
and most people won't care if they can boot to 2.4.10 or not, and if they
care about this they will also know enough not to enable extents.  The ext3
developers are a very cautious bunch, and don't force anything onto users.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:53                     ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 17:01                       ` Alex Tomas
  2006-06-09 17:10                         ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 17:01 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alex Tomas, Linus Torvalds, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel, Andreas Dilger

that's why we're trying to get it in *now*. because we need it.
and nobody AFAIK insists to make extents default or such.

thanks, Alex

>>>>> Jeff Garzik (JG) writes:

 JG> If terabyte drives will be here in less than a year, and 750GB drives
 JG> are already here, then people with today's commodity hardware are
 JG> probably already chomping at the bit to do >2TB LVM and RAID.

 JG> Hook eight 750GB SATA drives to a Marvell SATA controller (all
 JG> commodity, all production) and you're way past 2TB.

 JG> 	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:54               ` [Ext2-devel] " Linus Torvalds
@ 2006-06-09 17:04                 ` Alex Tomas
  2006-06-09 17:30                   ` [Ext2-devel] " Linus Torvalds
  2006-06-09 17:44                 ` Theodore Tso
  2006-06-09 18:10                 ` [Ext2-devel] " Andreas Dilger
  2 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 17:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

oops :) I don't follow that well ... 

size of in-core inodes is a different problem.

thanks, Alex

>>>>> Linus Torvalds (LT) writes:

 LT> On Fri, 9 Jun 2006, Linus Torvalds wrote:
 >> 
 >> Just as an example: ext3 _sucks_ in many ways. It has huge inodes that 
 >> take up way too much space in memory.

 LT> Btw, I'm not kidding you on this one.

 LT> THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!

 LT> And you know what? 2TB files are totally uninteresting to 99.9999% of all 
 LT> people. Most people find it _much_ more interesting to have hundreds of 
 LT> thousands of _smaller_ files instead.

 LT> So do this:

 LT> 	cat /proc/slabinfo | grep ext3

 LT> and be absolutely disgusted and horrified by the size of those inodes 
 LT> already, and ask yourself whether extending the block size to 48 bits will 
 LT> help or further hurt one of the biggest problems of ext3 right now?

 LT> (And yes, I realize that block numbers are just a small part of it. The 
 LT> "vfs_inode" is also a real problem - it's got _way_ too many large 
 LT> list-heads that explode on a 64-bit kernel, for example. Oh, well. My 
 LT> point is that things like this can make a very real issue _worse_ for all 
 LT> the people who don't care one whit about it)

 LT> 		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:56       ` Andrew Morton
@ 2006-06-09 17:07         ` Jeff Garzik
  2006-06-09 17:35           ` Andrew Morton
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hch, cmm, linux-kernel, ext2-devel, linux-fsdevel

Andrew Morton wrote:
> On Fri, 09 Jun 2006 11:40:03 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
> 
>> Users are now forced to remember that, if they write to their filesystem 
>> after using either $mmver or $korgver kernels, they are locked out of 
>> using older kernels.
> 
> The same happens if we create ext4 - earlier kernels don't support that,
> either.
> 
> I suppose we could call it ext4, although that wouldn't make much
> difference operationally.  The developers would probably choose to generate
> ext4 from the same codebase as ext3 for maintainability reasons, rather
> than choosing to copy-n-modify.  We'd need to see the patches to be able to
> finally make that judgement.

I would propose the obvious...  'cp -a ext3 ext4', apply the extent and 
48bit patches, and then do the obvious search-n-replace.

I guarantee that developer momentum would take over from there.  Rather 
than fundamentally change ext3, let's let it stabilize.

>> And as features continue to be added in this manner, this problem gets 
>> _exponentially_ worse.
> 
> "continue to be added"?  afaik this is the first time this has happened,
> and there's no plan to do it again.

ext3 developers are _fundamentally changing_ the block allocation 
structure [in a good way].  If they can get away with it once, they will 
continue to modify ext3, adding btrees and other new gadgets.  That's 
just human nature.  For example, htree was a minor disaster, 
deployment-wise, on the distro vendor side.

I think extents and 48bit are so fundamental that it's silly to attempt 
to minimize the impact from the user's perspective, and moreover, I 
think Linux benefits more if ext3 is _not_ kept on life support this way.

We need to draw a line in the sand.  If we don't, no one ever will.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:01                       ` Alex Tomas
@ 2006-06-09 17:10                         ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:10 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
> that's why we're trying to get it in *now*. because we need it.
> and nobody AFAIK insists to make extents default or such.

huh?  If its needed, it will be default eventually.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:12                   ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 17:12                     ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:12 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> That is what the entirety of Linux development is -- step-by-step.
> 
>  JG> It is OBVIOUS that it would take five minutes to start ext4.
> 
> right. it's not a problem to *start*. it's a problem it maintain.
> day by day fs/ext3 and fs/ext4 will get more and more diffs.
> at some point it will be a headache to apply patches from ext3
> to ext4 and back. I known this very well ....

As Linus has stated, we have empirical evidence that splitting 
filesystems works, for both stability and development speed.

The number of patches to ext[23] will trickle off over time.  As the 
obvious example, ext4 would receive the extent and 48bit patches rather 
than ext3 :)

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:55                 ` Jeff Garzik
@ 2006-06-09 17:12                   ` Alex Tomas
  2006-06-09 17:12                     ` Jeff Garzik
  2006-06-09 19:57                   ` Theodore Tso
  2006-06-10  0:07                   ` Olivier Galibert
  2 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 17:12 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Alex Tomas, Linus Torvalds, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel, Andreas Dilger

>>>>> Jeff Garzik (JG) writes:

 JG> That is what the entirety of Linux development is -- step-by-step.

 JG> It is OBVIOUS that it would take five minutes to start ext4.

right. it's not a problem to *start*. it's a problem it maintain.
day by day fs/ext3 and fs/ext4 will get more and more diffs.
at some point it will be a headache to apply patches from ext3
to ext4 and back. I known this very well ....

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:25             ` Linus Torvalds
  2006-06-09 16:48               ` Alex Tomas
  2006-06-09 16:54               ` [Ext2-devel] " Linus Torvalds
@ 2006-06-09 17:12               ` Jeff Anderson-Lee
  2006-06-09 18:02               ` Andrew Morton
  3 siblings, 0 replies; 296+ messages in thread
From: Jeff Anderson-Lee @ 2006-06-09 17:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: 'ext2-devel', linux-fsdevel

Linus Torvalds wrote:
> On Fri, 9 Jun 2006, Alex Tomas wrote:
>  
> > I believe it's as stable as before until you mount with extents
> > mount option.
>    
> In contrast, the last time two different filesystems introduced bugs in 
> each other was approximately "never". They simply don't modify each others

> code, they don't look at each others data structures, and they don't jump 
> into each others routines.

As an interested bystander (and large filesystem user), I'd say I tend to 
agree with Linus and Jeff on this one.

* ext3 is arguably the main Linux filesystem: too important to keep 
  "experimenting" with.

* I'd encourage a >2TB version, but call it ext4.  It makes it clear
  that you are entering new territory.

* Take advantage of the switch to remove some of the backward compatibility
  cruft from the ext4 version -- make it a clean, explicit break.

* [Possibly even inoculate ext3 against creeping featuris and work on 
  cleanup and optimization instead.]

This is not intended to slight the work/position of the ext3 developers,
merely to inform them of an end-user's perspective.

----
Jeff Anderson-Lee
Petabyte Storage Infrastructure Project
University of California Berkeley
"Simplify, simplify, simplify." -- Henry David Thoreau 
"I think one 'simplify' would have sufficed." -- Ralph Waldo Emerson 

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09  2:49 ` Jeff Garzik
  2006-06-09  8:35   ` Andreas Dilger
@ 2006-06-09 17:14   ` Alan Cox
  1 sibling, 0 replies; 296+ messages in thread
From: Alan Cox @ 2006-06-09 17:14 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: cmm, Andrew Morton, Linus Torvalds, linux-kernel, ext2-devel,
	linux-fsdevel

Ar Iau, 2006-06-08 am 22:49 -0400, ysgrifennodd Jeff Garzik:
> People (including me) still switch back and forth between ext2 and ext3 
> mounts of the same filesystem on occasion.  I think creating an "ext4" 
> would allow for greater developer flexibility in implementing new 
> features and ditching old ones -- while also emphasizing to the user 
> that switching back and forth between ext4 and ext[23] is a bad idea.

I would agree with this, particularly as ext3 and ext4 are quite small
in the kernel side of things and people needing 48bit extents are
probably not trying to run on 8MB of flash.

Alan


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:42       ` Matthew Wilcox
  2006-06-09 15:51         ` Jeff Garzik
@ 2006-06-09 17:29         ` Alan Cox
  1 sibling, 0 replies; 296+ messages in thread
From: Alan Cox @ 2006-06-09 17:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Christoph Hellwig, cmm, linux-fsdevel

Ar Gwe, 2006-06-09 am 09:42 -0600, ysgrifennodd Matthew Wilcox:
> Hang on, you're going too far.  You have to enable extents with the
> extent mount option.  Otherwise you don't get to use them.  The user
> does, in fact, have a clear division, although maybe the blinky signs
> aren't quite luminous enough.

<mba marketing>
I'd rather the blinky sign was "ext4". It makes it clear it is a
progression and it also gives everyone something to put in the features
box and talk to the press about 8)
</mba>

> I still think making ext3 bigger than 16TB is just silly.

We recently fixed a 'If the disk is 4TB in size the geometry reporting
breaks and parted crashes' bug. The stuff is out there and people want
to run ext3 on it or an ext3 derivative they feel they trust. Does it
matter whether it is the most optimal solution, that'll sort itself out
as ext3.5/ext4, reiser4, jfs, xfs etc get picked and demanded by users

Alan

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:04                 ` Alex Tomas
@ 2006-06-09 17:30                   ` Linus Torvalds
  2006-06-09 17:41                     ` Matthew Wilcox
  0 siblings, 1 reply; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 17:30 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

On Fri, 9 Jun 2006, Alex Tomas wrote:
> 
> oops :) I don't follow that well ... 
> 
> size of in-core inodes is a different problem.

Not really. It's really the same problem: adding features has a real cost.

And the cost is higher if you don't add them in a way that is statically 
separable.

So I'm not trying to make the in-core inode size be "the thing" to 
concentrate on. And I'm not saying that extents is inherently "the thing" 
that makes it sane to split up development. That time might have been a 
few years ago, or it might be in the future.

So don't get me wrong. I'm (a) generally supporting Jeff in that I think 
it makes sense to split projects off occasionally, and maybe even plan on 
hopefully make the original project be deleted in the long run (it does 
actually happen, although it is fairly rare). And (b) trying to show the 
costs.

For me, the biggest cost tends to actually be support. A stable filesystem 
that is used by thousands and thousands of people and that isn't actually 
developed outside of just maintaining it IS A REALLY GOOD THING TO HAVE. 

And I'm not saying that just because it's a filesystem, and people get 
upset if they lose data. No, I'm saying it because from a maintenance 
standpoint, such a filesystem has almost zero cost.

So from a maintenance stanpoint, it's actually a _lot_ more useful to me 
(and probably to a lot of other people) if development is done as its own 
project, and is merged as its own sub-project. When problems happen, it's 
fairly obvious what they are, and it's very much a case of all the people 
involved having made that choice ("Hey, you knew it wasn't as stable, but 
you wanted it for your special needs").

As an additional bonus, it tends to help find patterns in bug-reports 
("ahh, everyone involved is running ext4"). So not only does it not affect 
people who don't want to be affected, it also helps _pinpoint_ where 
problems are when they do happen.

Also, if it turns out that the stabilization thing worked well, and after 
a few years the _new_ code hasn't gotten any changes, and there are no 
other real downsides either, they can actually be merged later on too. 

That's what we're seeing in the 64-bit architecture support on both s390 
and powerpc (and maybe even x86, eventually? Possibly not, but who 
knows..). But that's a separate issue.

			Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:56               ` Andreas Dilger
@ 2006-06-09 17:32                 ` Greg KH
  2006-06-09 18:48                 ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: Greg KH @ 2006-06-09 17:32 UTC (permalink / raw)
  To: Jeff Garzik, Alex Tomas, Christoph Hellwig, linux-fsdevel,
	ext2-devel, Mingming Cao, linux-kernel

On Fri, Jun 09, 2006 at 10:56:43AM -0600, Andreas Dilger wrote:
> It's a lot better than e.g. the latest ubuntu which (apparently,
> I read) can't mount a kernel older than 2.6.15 because of udev (or
> sysfs?) changes.

If this is true, then it's only because the Ubuntu developers do not
want to support older kernel versions.  Other distros handle this just
fine (Gentoo and Debian for example).  This is not a kernel issue, but
rather a distro design issue.

Which is much different from the fact that I take a "ext3" partition
from my new distro and can't get to the data if I downgrade to an older
distro for whatever reason (or use an older rescue disk.)

Don't confuse distro design decisions from issues forced on an unknowing
user by the ext3 fs kernel developers.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:07         ` Jeff Garzik
@ 2006-06-09 17:35           ` Andrew Morton
  2006-06-09 17:48             ` Jeff Garzik
  2006-06-09 21:42             ` Sonny Rao
  0 siblings, 2 replies; 296+ messages in thread
From: Andrew Morton @ 2006-06-09 17:35 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: hch, linux-fsdevel, ext2-devel, cmm, linux-kernel

On Fri, 09 Jun 2006 13:07:37 -0400
Jeff Garzik <jeff@garzik.org> wrote:

> I would propose the obvious...  'cp -a ext3 ext4', apply the extent and 
> 48bit patches, and then do the obvious search-n-replace.

Most of ext3 is JBD.  At least, in terms of complexity.  And I don't think
there's anything in this proposal which affects JBD, apart from changing
the blocksize.

Cloning JBD for this exercise would, I suspect, be the wrong thing to do -
the two clones would be pretty much identical, apart from some scalar
types.

I did suggest a couple of years ago that we should clone the ext3 part and
have both ext3 and ext4 use the same JBD layer - I don't know what happened
to that idea.

There has been steady, cautious but significant improvement happening in
ext3 over the past few years.  I'd expect that to continue, although
perhaps at a lower rate.  Having to apply the same changes to two
filesystems would be an obvious loss.

It comes down to looking at the patches, and I haven't done that in quite
some time.  Ideally the new functionality would all be under CONFIG_foo,
but I do not know if that is being proposed here?

> We need to draw a line in the sand.  If we don't, no one ever will.

You speak as if this is something which has happened before, or that it will
happen again.

All that being said, Linux's filesystems are looking increasingly crufty
and we are getting to the time where we would benefit from a greenfield
start-a-new-one.  That new one might even be based on reiser4 - has anyone
looked?  It's been sitting around for a couple of years.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:30                   ` [Ext2-devel] " Linus Torvalds
@ 2006-06-09 17:41                     ` Matthew Wilcox
  2006-06-09 17:50                       ` Jeff Garzik
                                         ` (2 more replies)
  0 siblings, 3 replies; 296+ messages in thread
From: Matthew Wilcox @ 2006-06-09 17:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

On Fri, Jun 09, 2006 at 10:30:06AM -0700, Linus Torvalds wrote:
> And I'm not saying that just because it's a filesystem, and people get 
> upset if they lose data. No, I'm saying it because from a maintenance 
> standpoint, such a filesystem has almost zero cost.

One of the costs (and I'm not disagreeing with your main point;
I think forking ext3 to ext4 at this point is reasonable), is that
bugfixes applied to one don't necessarily get applied to the other.
I found some recently between ext2 and ext3, and submitted those, but I
only audited one file.  There's lots more to look at and I just haven't
found the time recently.  Going to three variations is a lot more work
for auditing, and it might be worth splitting some bits which genuinely
are the same into common code.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:54               ` [Ext2-devel] " Linus Torvalds
  2006-06-09 17:04                 ` Alex Tomas
@ 2006-06-09 17:44                 ` Theodore Tso
  2006-06-09 17:58                   ` Jeff Garzik
  2006-06-09 18:10                 ` [Ext2-devel] " Andreas Dilger
  2 siblings, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 17:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

On Fri, Jun 09, 2006 at 09:54:49AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 9 Jun 2006, Linus Torvalds wrote:
> > 
> > Just as an example: ext3 _sucks_ in many ways. It has huge inodes that 
> > take up way too much space in memory.
> 
> Btw, I'm not kidding you on this one.
> 
> THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!

To be fair, the bulk of the size of the size of the inode is is the
filesystem generic "struct inode", which is 480 bytes.  Ext3 just
includes the struct inode as part of its core data structure, which
makes the whole thing *look* big.  In fact, the ext3-specific part of
the in-core ext3 inode is only 188 bytes, for a total of 688 bytes for
the ext3_inode_info structure --- which is what you see in
/proc/slabinfo.   

Other filesystems store the struct inode via a pointer to a separately
allocated chunk of memory, which makes their in-core inode footprint
*look* smaller but that's just an illusion if the only place you look
is /proc/slabinfo.  For example, the xfs_inode_cache is 432 bytes, but
that's because struct inode is stored separately from xfs's
fake/pseudo "vfs inode" which it keeps around so the same code can be
used with Irix.  (It always amazes me that we allow this for XFS,
where when everywhere else we insist that that kind of cross-OS or
cross-version portability code is a fundamental violation of
CodingStyle which by increasing code bloat and making the code harder
to read and maintain by Linux developers, but that's a rant for
another day.)

Now, obviously I won't say that we can't do work to trim down
ext3_inode_info.  To be fair, reiserfs has an inode which is 576 bytes
long, so they only have 96 bytes of filesystem-specific information,
instead of the 188 bytes that we have in ext3.  So we can do look at
that, but remember that from the gross level, we're talking about 688
bytes per inode for ext3 compared to 576 bytes per inode for reiserfs
--- and at least 912 bytes per inode for xfs.

But I think you would agree that we would want to improve this number
"honestly", by trying to trim down actual memory structure use,
instead of just simply making the in-core data structure bushier so as
to hide the true size of the per-inode footprint from people looking
at /proc/slabinfo, right?  :-)

And in any case, this is why we have to think very carefully before
forking the codebase between ext3 and "ext4".  The work that we might
use to slim down ext4_inode_info would also have to be backported to
ext3_inode_info before ext3 users see the benefit.  And there may also
be bugs that now have to be fixed in _three_ separate codebases ---
ext2, ext3, and ext4.  To give another concrete example, adding
extents won't change the htree directory lookup code, so needlessly
having two copies of that htree code in the kernel would be a Bad
Idea(tm).  We've already on occasion found bugs that we had fixed in
ext3, but had forgotten to backport to ext2, and vice versa.  Adding a
third would triple our maintenance headache --- a similar reason why
we haven't started a 2.7 development tree yet, since we would have to
backport bug fixes back and forth between 2.6 and 2.7.

Not to say that forking ext3 to make a copy of the code that we call
"ext4" isn't automatically a bad idea to be dismissed out of hand,
just as someday that we might fork 2.6 and start a 2.7 development
branch.  But in both cases we need to think very hard about the
tradeoffs before we just go ahead and do it.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:35           ` Andrew Morton
@ 2006-06-09 17:48             ` Jeff Garzik
  2006-06-09 17:59               ` Jeff Garzik
  2006-06-09 21:42             ` Sonny Rao
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hch, linux-fsdevel, ext2-devel, cmm, linux-kernel

Andrew Morton wrote:
> On Fri, 09 Jun 2006 13:07:37 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
> 
>> I would propose the obvious...  'cp -a ext3 ext4', apply the extent and 
>> 48bit patches, and then do the obvious search-n-replace.
> 
> Most of ext3 is JBD.  At least, in terms of complexity.  And I don't think
> there's anything in this proposal which affects JBD, apart from changing
> the blocksize.
> 
> Cloning JBD for this exercise would, I suspect, be the wrong thing to do -
> the two clones would be pretty much identical, apart from some scalar
> types.
> 
> I did suggest a couple of years ago that we should clone the ext3 part and
> have both ext3 and ext4 use the same JBD layer - I don't know what happened
> to that idea.

The JBD API is reasonably distinct, so IMO this would be a logical next 
step.  I would hope they could use the same JBD, so, I strongly agree...


> There has been steady, cautious but significant improvement happening in
> ext3 over the past few years.  I'd expect that to continue, although
> perhaps at a lower rate.  Having to apply the same changes to two
> filesystems would be an obvious loss.

I disagree completely...  it would be an obvious win:  people who want 
stability get that, people who want new features get that too.


> It comes down to looking at the patches, and I haven't done that in quite
> some time.  Ideally the new functionality would all be under CONFIG_foo,
> but I do not know if that is being proposed here?
> 
>> We need to draw a line in the sand.  If we don't, no one ever will.
> 
> You speak as if this is something which has happened before, or that it will
> happen again.
> 
> All that being said, Linux's filesystems are looking increasingly crufty
> and we are getting to the time where we would benefit from a greenfield
> start-a-new-one.  That new one might even be based on reiser4 - has anyone
> looked?  It's been sitting around for a couple of years.

reiser4 actually has this same problem, but worse.  It has pluggable 
metadata even to the point of supporting plugin-style metadata development.

If we can successfully devolve a filesystem to metadata and algorithm 
plugins, that should be done at the VFS level, and not called "reiser4".

But in the absence of a different VFS API, I think it is the most 
practical of all the options to open the floodgates to ext4 rather than 
ext3.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:41                     ` Matthew Wilcox
@ 2006-06-09 17:50                       ` Jeff Garzik
  2006-06-09 18:00                         ` Alex Tomas
  2006-06-09 18:04                       ` [Ext2-devel] " Linus Torvalds
  2006-06-09 18:17                       ` Michael Poole
  2 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

Matthew Wilcox wrote:
> On Fri, Jun 09, 2006 at 10:30:06AM -0700, Linus Torvalds wrote:
>> And I'm not saying that just because it's a filesystem, and people get 
>> upset if they lose data. No, I'm saying it because from a maintenance 
>> standpoint, such a filesystem has almost zero cost.
> 
> One of the costs (and I'm not disagreeing with your main point;
> I think forking ext3 to ext4 at this point is reasonable), is that
> bugfixes applied to one don't necessarily get applied to the other.
> I found some recently between ext2 and ext3, and submitted those, but I
> only audited one file.  There's lots more to look at and I just haven't
> found the time recently.  Going to three variations is a lot more work
> for auditing, and it might be worth splitting some bits which genuinely
> are the same into common code.

With extents and 48bit, you have multiple code paths to audit, regardless.

If applied to ext3, you have to audit

	fs/ext3/*.c:
		if (extents)
			...
		else
			...

as opposed to

	fs/ext3/*.c:
		...	non-extent code
	fs/ext4/*.c:
		...	extent code

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:09           ` Linus Torvalds
@ 2006-06-09 17:58             ` Gerrit Huizenga
  2006-06-09 18:25               ` [Ext2-devel] " Chase Venters
                                 ` (2 more replies)
  0 siblings, 3 replies; 296+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 17:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

On Fri, 09 Jun 2006 09:09:01 PDT, Linus Torvalds wrote:
> On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
> > 
> > Jeff's approach taken to the rediculous would mean that we'd have
> > ext versions 1-40 by now at least.  I don't think that helps much,
> > either.
> 
> On the other hand, I _guarantee_ you that it helps that we have ext2-3, 
> and not just ext2 (nobody even tried to keep ext1 compatible, thank the 
> Gods).

I had originally argued for ext4 as well based on the fact that it would
allow lots of potential cleanups & simplifications and at the same time
would allow a break in the on disk filesystems layout.

These changes don't yet change the actual on-disk layout and that might
be something that would be done if ext4 were a real, new filesystem.

But then how long until ext4 is used enough to be put into production?
How much testing will it *really* get in any form?  How long before
the people that are using 100 TB+ disk farms today (some of which are
chopping filesystems into 2-8 GB chunks, others with 2 TB filesystems
today) actually trust this new filesystem (most vendors don't support
JFS today, XFS support isn't much better).

We are seeing storage needs increasing at a frightening rate.  Health
Care folks want to store your MRI's, x-ray's, ultraounds, etc. in high
res digital format across your entire life in near-line format.  Terabytes
over time per person.  Europe is already doing this pretty extensively,
the US is following suit.  Digital media creation has huge storage needs.
Most everything is moving to podcasts, webcasts, streaming audio & video.
Storage is huge, and ext3 is at the current breaking point.

I'd argue that whatever we call it, we need a standard, stable, supported
solution *soon* for large files, large filesystems, large storage systems
in Linux.

I'd think the quickest path is to relieve the pressure now in ext3.

We still haven't solved the filesystem check time problem, which is the
next big bugaboo.  But getting large fileysstems to real customers soon,
e.g. in mainline, well tested, ready for distro support is my real goal.

> If for no other reason, than the fact that the ext3 development could be 
> much more aggressive early on. Exactly because it did NOT screw up the old 
> filesystem that everybody else depended on.

Yes, but we want agressive with robustness for real users soon.  Lots
of crazy ext4 development could become technical wanking in no time, with
no point of stability, and no general usefulness in the short term.

> So we have empirical evidence that splitting filesystem work up does 
> actually help. 

Agreed.  But... Maybe that should be the set of changes *following*
extents.  Then the file format can change, several of the pending ideas
can be worked in, and some of the backwards compatibility can be cleaned
out if it is in the way.  Then the extents work can get us something
usable in all the interim distro releases for the real users who are
screaming now about the filesystem size limits.

gerrit

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:44                 ` Theodore Tso
@ 2006-06-09 17:58                   ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:58 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

Theodore Tso wrote:
> And in any case, this is why we have to think very carefully before
> forking the codebase between ext3 and "ext4".  The work that we might
> use to slim down ext4_inode_info would also have to be backported to
> ext3_inode_info before ext3 users see the benefit.  And there may also

No, the entire point is that you stop backporting all the junk, and just 
leave ext3 as is.  Let it sit, let it stabilize.

New development -- including inode slimming work -- can be best done in 
ext4.  With ext3, you are fighting all those old back-compat features 
and associated code paths bloating up the in-core inode [code].

_Obviously_ there may be bugs found in three codebases, rather than two. 
  But over time those will trickle off, particularly when developers 
successfully resist the urge to continue modifying ext[23].

There will always newer, bigger storage situations and arrays, and I 
think it's a mistake to continue modifying the same Linux filesystem to 
support all these situations.  The logical end result is a big, unwieldy 
codebase that supports $N metadata, data, and journal formats.

In the same way we don't stuff support for all PCI ethernet or SATA 
drivers into the same .o file, we shouldn't keep stuffing support for 
all these varying filesystem formats into ext3.o.  That creates (and 
extents exacerbate) the "what ext3 fs am I mounting, today?" support 
problem.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:48             ` Jeff Garzik
@ 2006-06-09 17:59               ` Jeff Garzik
  2006-06-09 18:27                 ` [Ext2-devel] " Mike Snitzer
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hch, linux-fsdevel, ext2-devel, cmm, linux-kernel

Jeff Garzik wrote:
> I disagree completely...  it would be an obvious win:  people who want 
> stability get that, people who want new features get that too.

And developers have a better outlet for their wacky developmental urges...

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:50                       ` Jeff Garzik
@ 2006-06-09 18:00                         ` Alex Tomas
  0 siblings, 0 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 18:00 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Matthew Wilcox, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger

IMHO, 3 (three) if's for a whole fs don't look that bad.
on the other side, you'd need to audit much more of
almost the same lines ...

thanks, Alex

>>>>> Jeff Garzik (JG) writes:

 JG> With extents and 48bit, you have multiple code paths to audit, regardless.

 JG> If applied to ext3, you have to audit

 JG> 	fs/ext3/*.c:
 JG> 		if (extents)
 JG> 			...
 JG> 		else
 JG> 			...

 JG> as opposed to

 JG> 	fs/ext3/*.c:
 JG> 		...	non-extent code
 JG> 	fs/ext4/*.c:
 JG> 		...	extent code

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:25             ` Linus Torvalds
                                 ` (2 preceding siblings ...)
  2006-06-09 17:12               ` Jeff Anderson-Lee
@ 2006-06-09 18:02               ` Andrew Morton
  3 siblings, 0 replies; 296+ messages in thread
From: Andrew Morton @ 2006-06-09 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: alex, jeff, ext2-devel, linux-kernel, cmm, linux-fsdevel, adilger

On Fri, 9 Jun 2006 09:25:57 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> (buffer heads! In 2006!)

We should be able to make the vast majority of those go away, btw.

We already have `-o data=writeback,nobh'.  That gives us writeback-mode
with no buffer_heads on the pagecache.

On top of that we can implement nobh ordered-mode by adding an inode walk
which calls do_sync_file_range() into the appropriate place in commit.

The tricky part is the inode walk - at present super_block.s_list is a
list_head and it's not trivial to walk that without missing some inodes.

Probably it could be done via a new fs-private dirty-inode list which we
hande carefully, or via a walk of an i_ino-ordered radix-tree, which
doesn't miss things.

I floated this a year or so ago, but no little fishies bit.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:41                     ` Matthew Wilcox
  2006-06-09 17:50                       ` Jeff Garzik
@ 2006-06-09 18:04                       ` Linus Torvalds
  2006-06-09 18:17                       ` Michael Poole
  2 siblings, 0 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 18:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alex Tomas, Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
	cmm, linux-fsdevel, Andreas Dilger

On Fri, 9 Jun 2006, Matthew Wilcox wrote:
> 
> One of the costs (and I'm not disagreeing with your main point;
> I think forking ext3 to ext4 at this point is reasonable), is that
> bugfixes applied to one don't necessarily get applied to the other.

I agree. However, that tends to be less of an issue of you fork off a 
stable base (which isn't always the case). Forking off something that is 
being stil actively developed is a different matter entirely. I don't 
think ext3 is in that situation, really.

Also, one of the issues is when there are big VFS layer changes, which 
affect all filesystems. Then, a lot of people will think that it's easier 
to fix up one unified filesystem than it is to fix up five separate ones, 
and the fact is, that's often _not_ the case.

The unified filesystem potentially has so much crud and crap and other 
issues that it ends up being much more work to understand and fix it up 
than it would have been to do the same thing for five different 
filesystems that didn't play a lot of games and have complex

  "if this flag is set, do this code, otherwise do that code, and this 
   whole directory reading code btw has a static CONFIG_EXT3_INDEX thing, 
   so you won't even know if you caught all the interface changes when you 
   get a clean compile"

So I'm not a huge believer in "shared code is good code". I believe shared 
code is good only if it has no conditionals.

Ie the VFS-layer kind of code that acts the SAME for everybody is the good 
kind of sharing. The kind where you call into different routines that will 
do different things depending on a flag (which may not even be obvious to 
the caller) is usually the _bad_ kind of sharing, because that's the kind 
of code that ends up working for one user and not working for another, and 
trying to make it work for both may be fundamentally hard.

The

	if (sb->option.extent) 
		.. do one thing ..
	else
		.. do another ..

kind of thing is exactly what leads to problems later. Even if it allows 
sharing of 90% of the code (the caller of the function), it leads to 
problems exactly because of things that end up not quite working because 
people only tested one code-path, and it broke the other case in some 
really subtle way.

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:07                 ` Alex Tomas
  2006-06-09 16:09                   ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 18:04                   ` Matthew Frost
  2006-06-09 18:10                     ` Alex Tomas
  2006-06-09 18:14                     ` [Ext2-devel] " Andreas Dilger
  1 sibling, 2 replies; 296+ messages in thread
From: Matthew Frost @ 2006-06-09 18:04 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Jeff Garzik, Linus Torvalds, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> Think about how this will be deployed in production, long term.
> 
>  JG> If extents are not made default at some point, then no one will use
>  JG> the feature, and it should not be merged.
> 
> sorry, I disagree. for example, NUMA isn't default and shouldn't be.
> but we have it in the tree and any one may choose to use it.

NUMA is designed to cope with a hardware feature, which not everybody 
has.  Filesystem upgrades are not qualitatively similar; it does not 
depend on one's hardware design as to whether one uses ext3, let alone 
extents.  Your logic is faulty.

  the same
> with extents. let's have it in. but let's make clear it's experimental,
> it makes sense for large files only, it isn't backward compatible and
> so on.
> 
> thanks, Alex
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:54               ` [Ext2-devel] " Linus Torvalds
  2006-06-09 17:04                 ` Alex Tomas
  2006-06-09 17:44                 ` Theodore Tso
@ 2006-06-09 18:10                 ` Andreas Dilger
  2006-06-09 18:22                   ` Linus Torvalds
                                     ` (2 more replies)
  2 siblings, 3 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Tomas, Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
	cmm, linux-fsdevel

On Jun 09, 2006  09:25  -700, Linus Torvalds wrote:
> So two separate filesystems are _less_ to maintain than one big one. Even
> if there's a lot of code that -could- be shared.

That is true if people are willing to maintain both trees.  I think that
even with the current ext2/ext3 split there are continually fixes that are
missing from one filesystem or another.

> Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
> take up way too much space in memory. It has absolutely disgusting code to
> handle directory reading and writing (buffer heads! In 2006!).

My point exactly!  The ext2 directory code was moved from buffer heads to
page cache by Al after ext3 was forked and the code was never fixed in ext3.

I don't see this getting any better if there is an ext4 filesystem and all
of the ext3 developers are only interested in maintaining ext4.  Look at
reiserfs - it is completely abandoned by Hans in favour of reiser4 (the
entry in MAINTAINERS notwithstanding) except for Chris Mason at SuSE.

Having a single codebase for everyone means that it is continually maintained
and users of ext3 aren't left out in the cold.

On Jun 09, 2006  09:54 -0700, Linus Torvalds wrote:
> Btw, I'm not kidding you on this one.
> 
> THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!

Do you think that would be any different with a new filesystem?

> And you know what? 2TB files are totally uninteresting to 99.9999% of all 
> people. Most people find it _much_ more interesting to have hundreds of 
> thousands of _smaller_ files instead.
> 
> So do this:
> 
> 	cat /proc/slabinfo | grep ext3

# head -2 /proc/slabinfo
slabinfo - version: 2.1
name       <active_objs> <num_objs> <objsize> <objperslab>

# grep ext2 /proc/slabinfo
ext2_inode_cache       0          0       572            7
ext2_xattr             0          0        48           81

# grep ext3 /proc/slabinfo

ext3_inode_cache   30207      41418       616            6
ext3_xattr             0          0        48           81

# grep xfs /proc/slabinfo
xfs_ili             2558       2576       140           28
xfs_inode           2558       2565       448            9

# grep jfs /proc/slabinfo
jfs_ip                 0          0      1048            3

So, the ext3 inode could grow another ~50 bytes without changing the
slab allocation size ;-), and in fact other filesystem aren't noticably
different.

> and be absolutely disgusted and horrified by the size of those inodes 
> already, and ask yourself whether extending the block size to 48 bits will 
> help or further hurt one of the biggest problems of ext3 right now?

This is then the biggest problem of all filesystems.

> (And yes, I realize that block numbers are just a small part of it. The 
> "vfs_inode" is also a real problem - it's got _way_ too many large 
> list-heads that explode on a 64-bit kernel, for example. Oh, well.

On a 32-bit system the vfs_inode is more than half of the size of the ext3
inode, it is worse on 64-bit systems.

> My point is that things like this can make a very real issue _worse_ for all 
> the people who don't care one whit about it)

The current group of changes will be a no-op if CONFIG_LBD isn't enabled,
and I think I argued fairly strongly to also have a CONFIG_ flag to allow
larger than 2TB file support only for those users that want it.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:04                   ` Matthew Frost
@ 2006-06-09 18:10                     ` Alex Tomas
  2006-06-09 18:14                     ` [Ext2-devel] " Andreas Dilger
  1 sibling, 0 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 18:10 UTC (permalink / raw)
  To: artusemrys
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger

>>>>> Matthew Frost (MF) writes:

 MF> Alex Tomas wrote:
 >>>>>>> Jeff Garzik (JG) writes:
 JG> Think about how this will be deployed in production, long term.
 JG> If extents are not made default at some point, then no one will
 >> use
 JG> the feature, and it should not be merged.
 >> sorry, I disagree. for example, NUMA isn't default and shouldn't be.
 >> but we have it in the tree and any one may choose to use it.

 MF> NUMA is designed to cope with a hardware feature, which not everybody
 MF> has.  Filesystem upgrades are not qualitatively similar; it does not
 MF> depend on one's hardware design as to whether one uses ext3, let alone
 MF> extents.  Your logic is faulty.

proposed 48bit extents patch addresses 2TB limit.


thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:04                   ` Matthew Frost
  2006-06-09 18:10                     ` Alex Tomas
@ 2006-06-09 18:14                     ` Andreas Dilger
  2006-06-09 18:51                       ` Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 18:14 UTC (permalink / raw)
  To: Matthew Frost
  Cc: Alex Tomas, Jeff Garzik, Linus Torvalds, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

On Jun 09, 2006  13:04 -0500, Matthew Frost wrote:
> Alex Tomas wrote:
> >sorry, I disagree. for example, NUMA isn't default and shouldn't be.
> >but we have it in the tree and any one may choose to use it.
> 
> NUMA is designed to cope with a hardware feature, which not everybody 
> has.  Filesystem upgrades are not qualitatively similar; it does not 
> depend on one's hardware design as to whether one uses ext3, let alone 
> extents.  Your logic is faulty.

If you have a > 8TB block device (which is common in large RAID devices
today, will be a single disk in a couple of years) then it is important
that your filesystem work with this block device.

If ext2 and ext3 didn't support > 2GB files (which was a filesystem
feature added in exactly the same way as extents are today, and nobody
bitched about it then) then they would be relegated to the same status
as minix and xiafs and all the other filesystems that are stuck in the
"we can't change" or "we aren't supported" camps.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:41                     ` Matthew Wilcox
  2006-06-09 17:50                       ` Jeff Garzik
  2006-06-09 18:04                       ` [Ext2-devel] " Linus Torvalds
@ 2006-06-09 18:17                       ` Michael Poole
  2 siblings, 0 replies; 296+ messages in thread
From: Michael Poole @ 2006-06-09 18:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel, Andreas Dilger

Matthew Wilcox writes:

> On Fri, Jun 09, 2006 at 10:30:06AM -0700, Linus Torvalds wrote:
> > And I'm not saying that just because it's a filesystem, and people get 
> > upset if they lose data. No, I'm saying it because from a maintenance 
> > standpoint, such a filesystem has almost zero cost.
> 
> One of the costs (and I'm not disagreeing with your main point;
> I think forking ext3 to ext4 at this point is reasonable), is that
> bugfixes applied to one don't necessarily get applied to the other.
> I found some recently between ext2 and ext3, and submitted those, but I
> only audited one file.  There's lots more to look at and I just haven't
> found the time recently.  Going to three variations is a lot more work
> for auditing, and it might be worth splitting some bits which genuinely
> are the same into common code.

If you want more details on this kind of issue, look at CP-Miner.  A
paper published earlier this year in IEEE TSE[1] reports that that
tool found 421 cut-and-paste-related possible bugs in Linux, of which
49 were real bugs, 249 were false positives, and 123 could not be
proven either true or false positives.

[1]- http://doi.ieeecomputersociety.org/10.1109/TSE.2006.28

Michael Poole

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:10                 ` [Ext2-devel] " Andreas Dilger
@ 2006-06-09 18:22                   ` Linus Torvalds
  2006-06-09 18:30                     ` Alex Tomas
  2006-06-09 18:40                   ` [Ext2-devel] " Jeff Garzik
  2006-06-09 18:41                   ` Jeff Garzik
  2 siblings, 1 reply; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 18:22 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Alex Tomas, Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
	cmm, linux-fsdevel

On Fri, 9 Jun 2006, Andreas Dilger wrote:
> missing from one filesystem or another.
> 
> > Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
> > take up way too much space in memory. It has absolutely disgusting code to
> > handle directory reading and writing (buffer heads! In 2006!).
> 
> My point exactly!  The ext2 directory code was moved from buffer heads to
> page cache by Al after ext3 was forked and the code was never fixed in ext3.

The code was never fixed in ext3, because ext3 is a pig in that area.

You misunderstand how this worked.

The reason ext2 got fixed was that ext2 was _simple_. It got fixed 
_despite_ the fact that it's not all that widely used any more, and not 
considered a really important filesystem. It got fixed because it wasn't 
too bad. It doesn't have all the crud that makes it a much more involved 
thing to do for ext3.

So if the ext2/3 split hadn't happened, _neither_ of them would be fixed.

See?

My point is, maintaining two different pieces is SIMPLER.

Even if that simplicity sometimes ends up meaning "not maintaining the 
other one".

So being out of sync is not a problem. It's a _feature_. 

> On Jun 09, 2006  09:54 -0700, Linus Torvalds wrote:
> > Btw, I'm not kidding you on this one.
> > 
> > THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!
> 
> Do you think that would be any different with a new filesystem?

It would be bigger, if you made ext3 do 48-bit block numbers.

See? ext3 would become strictly _worse_ for the majority of users, who 
wouldn't get any advantage. That's my point.

> So, the ext3 inode could grow another ~50 bytes without changing the
> slab allocation size ;-), and in fact other filesystem aren't noticably
> different.

Yes, I already pointed out that the biggest part of it was actually the 
vfs_inode thing.

And btw, growing more than 50 bytes is exactly what it would do. Go look.

> This is then the biggest problem of all filesystems.

Yeah, under many loads it is. We do really badly with lots of metadata in 
memory. Why do you think people have historically complained about things 
like the updatedb flushing their disk cache?

If you look at disk access patterns, one of _the_ biggest problems is not 
in readign individual files. It's in inode atime updates and the other 
"stupid crap" stuff.

> On a 32-bit system the vfs_inode is more than half of the size of the ext3
> inode, it is worse on 64-bit systems.

..which I pointed out, and doesn't change my point one _whit_. 

The fact that the block numbers aren't the _only_ problem doesn't suddenly 
mean they are problem-free, does it?

			Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:40     ` Jeff Garzik
  2006-06-09 15:42       ` Matthew Wilcox
  2006-06-09 16:56       ` Andrew Morton
@ 2006-06-09 18:23       ` Michael Poole
  2006-06-09 18:55         ` Jeff Garzik
  2006-06-10  0:49         ` Sven-Haegar Koch
  2 siblings, 2 replies; 296+ messages in thread
From: Michael Poole @ 2006-06-09 18:23 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Christoph Hellwig, cmm, linux-kernel, ext2-devel,
	linux-fsdevel

Jeff Garzik writes:

> Andrew Morton wrote:
> > Ted&co have been pretty good at avoiding compatibility problems.
> 
> Well, extents and 48bit make that track record demonstrably worse.
> 
> Users are now forced to remember that, if they write to their
> filesystem after using either $mmver or $korgver kernels, they are
> locked out of using older kernels.

Users are also forced to remember that, if they use certain new
distros or programs, they are locked out of using older kernels.  They
are forced to remember that if they have certain newer hardware, they
are locked out of using older kernels.  They are forced to remember
that if they use ext3 (or XFS or JFS) _at all_ they are locked out of
using older kernels.  Why single out this particular aspect of limited
forward compatibility to harp on so much?

Michael Poole

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:58             ` Gerrit Huizenga
@ 2006-06-09 18:25               ` Chase Venters
  2006-06-10 13:46               ` Adrian Bunk
  2006-06-13 13:34               ` [Ext2-devel] " Helge Hafting
  2 siblings, 0 replies; 296+ messages in thread
From: Chase Venters @ 2006-06-09 18:25 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel, Andreas Dilger

On Fri, 9 Jun 2006, Gerrit Huizenga wrote:

> We are seeing storage needs increasing at a frightening rate.  Health
> Care folks want to store your MRI's, x-ray's, ultraounds, etc. in high
> res digital format across your entire life in near-line format.  Terabytes
> over time per person.  Europe is already doing this pretty extensively,
> the US is following suit.  Digital media creation has huge storage needs.
> Most everything is moving to podcasts, webcasts, streaming audio & video.
> Storage is huge, and ext3 is at the current breaking point.
>
> I'd argue that whatever we call it, we need a standard, stable, supported
> solution *soon* for large files, large filesystems, large storage systems
> in Linux.
>
> I'd think the quickest path is to relieve the pressure now in ext3.

Makes sense...

>> So we have empirical evidence that splitting filesystem work up does
>> actually help.
>
> Agreed.  But... Maybe that should be the set of changes *following*
> extents.  Then the file format can change, several of the pending ideas
> can be worked in, and some of the backwards compatibility can be cleaned
> out if it is in the way.  Then the extents work can get us something
> usable in all the interim distro releases for the real users who are
> screaming now about the filesystem size limits.

Let's call ext3 "Linux 2.4" for a second and ext(x) w/extents and 48-bit 
"Linux 2.5". We can now do all the crazy, wild work we want on 2.5, but 
people need it tomorrow. And they can have it, but we're stamping 
"Dangerous! Dangerous! Unstable! API changes every 5 minutes, your data 
will be obsoleted each release!" all over it. This goes on for years until 
we finally reach a point where we can roll out "Linux 2.6".

The trouble is that "Linux 2.6" is something many of us are going to be 
wanting _now_.

Now, taking the quotes back off "Linux 2.6" and speaking about the kernel 
as a whole again - isn't lots of incremental stable releases with new 
functionality something that cutting off the development arm made 
possible?

I acknowledge the concerns about filesystem stability and Linus's points 
about improperly shared code. From a practical standpoint, I see the need 
of bigger filesystems coming.

And the biggest practical problem I see is one of perception. Making 
'ext4' means labelling it unstable for a while. And once something like a 
_filesystem_ is called unstable, it's going to be a long time before 
people trust it with terabytes of their incredibly valuable data (even if 
we promise them that it's mostly an ext3 fork).

Whereas if you play with some experimental 48-bit extension on ext3, well, 
ext3 already has a good reputation and is in use everywhere, so maybe this 
isn't a bad "last feature" to add before forking off into ext4-land?

> gerrit

Cheers,
Chase

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:59               ` Jeff Garzik
@ 2006-06-09 18:27                 ` Mike Snitzer
  2006-06-09 18:54                   ` Jeff Garzik
  2006-06-10 13:49                   ` Adrian Bunk
  0 siblings, 2 replies; 296+ messages in thread
From: Mike Snitzer @ 2006-06-09 18:27 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, hch, linux-fsdevel, ext2-devel, cmm, linux-kernel

On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
> Jeff Garzik wrote:
> > I disagree completely...  it would be an obvious win:  people who want
> > stability get that, people who want new features get that too.
>
> And developers have a better outlet for their wacky developmental urges...

And no real-world near-term progress is made for production users with
modern requirements. What you're advocating breeds instability in the
near-term.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:44         ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
  2006-06-09 15:53           ` Alex Tomas
@ 2006-06-09 18:29           ` Andreas Dilger
  1 sibling, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 18:29 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas

On Jun 09, 2006  11:44 -0400, Jeff Garzik wrote:
> b) watch users boot w/ extents, accidentally do something silly like 
> writing data to a file, and become locked into a new subset of kernels?
> 
> The simple act of writing data to a file has become an _irrevocable 
> filesystem upgrade event_.

You keep on saying this, but you know it won't happen TODAY.  On the contrary,
if extents are merged today, I don't see distros making it a default mount
option for YEARS (it won't be the default for RHEL5, which is the only distro
that has participation on the ext3 developers, I can't comment for others).

WHEN extents become the default (which I hope they will at some point, like
dir_index and large inodes, that have been around for years already too)
then it will be mostly a non-issue (how many times do you boot into 2.2?).

The only exception is if you have a filesystem larger than 16TB you have
to use extents, which isn't an issue either way.  I don't think they will
ever become the default for e.g. root or boot filesystems, just for
compatibility reasons, but are highly desirable for e.g. mythtv or other
"large file" using filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:22                   ` Linus Torvalds
@ 2006-06-09 18:30                     ` Alex Tomas
  2006-06-09 18:38                       ` Linus Torvalds
                                         ` (2 more replies)
  0 siblings, 3 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

>>>>> Linus Torvalds (LT) writes:
 LT> My point is, maintaining two different pieces is SIMPLER.

"different" is a key word here. why should we copy most of ext3 code
into ext4?

 LT> It would be bigger, if you made ext3 do 48-bit block numbers.

nope, we re-use existing i_data w/o any changes. yes, we've made
inode a bit larger to cache last found extent. this improves
performance in some workloads noticable though.

 LT> See? ext3 would become strictly _worse_ for the majority of users, who 
 LT> wouldn't get any advantage. That's my point.

would "#if CONFIG_EXT3_EXTENTS" be a good solution then?

thanks. Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09  8:20   ` Andreas Dilger
@ 2006-06-09 18:35     ` Stephen C. Tweedie
  2006-06-09 19:20       ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 18:35 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Valdis.Kletnieks, linux-fsdevel, ext2-devel@lists.sourceforge.net,
	Mingming Cao, linux-kernel, Stephen Tweedie

Hi,

On Fri, 2006-06-09 at 02:20 -0600, Andreas Dilger wrote:

> > which implies matching changes to mkfs.ext2 and possibly mount..
> 
> The extents format doesn't need any support from mke2fs.  Currently this
> is activated by a mount option "-o extents", so it won't be used until
> a system administrator actively enables it.

It does need support from e2fsprogs, though; patches have been posed to
ext2-devel and are available on 

	http://www.bullopensource.org/ext4/index.html

though there is work left to do, especially to improve fsck's ability to
repair partially-damaged extent trees.

--Stephen



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:30                     ` Alex Tomas
@ 2006-06-09 18:38                       ` Linus Torvalds
  2006-06-09 18:50                         ` [Ext2-devel] " Chase Venters
                                           ` (2 more replies)
  2006-06-09 18:43                       ` [Ext2-devel] " Jeff Garzik
  2006-06-09 18:50                       ` Diego Calleja
  2 siblings, 3 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 18:38 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger



On Fri, 9 Jun 2006, Alex Tomas wrote:
> 
> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?

Let's put it this way:
 - have you had _any_ valid argument at all against "ext4"?

Think about it. Honestly. Tell me anything that doesn't work?

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:10                 ` [Ext2-devel] " Andreas Dilger
  2006-06-09 18:22                   ` Linus Torvalds
@ 2006-06-09 18:40                   ` Jeff Garzik
  2006-06-09 18:59                     ` Andrew Morton
  2006-06-09 18:41                   ` Jeff Garzik
  2 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:40 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Linus Torvalds, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel

Andreas Dilger wrote:
> Having a single codebase for everyone means that it is continually maintained
> and users of ext3 aren't left out in the cold.

That implies continually upgrading ext3 for newer storage technologies, 
which in turn implies adding all sorts of incompatible formats to 
support better storage scaling, and new usage models.

This constant patching of ext3 is IMO one of the problems.  Let it 
stabilize with current storage technologies.


> On Jun 09, 2006  09:54 -0700, Linus Torvalds wrote:
>> Btw, I'm not kidding you on this one.
>>
>> THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!
> 
> Do you think that would be any different with a new filesystem?
> 
>> And you know what? 2TB files are totally uninteresting to 99.9999% of all 
>> people. Most people find it _much_ more interesting to have hundreds of 
>> thousands of _smaller_ files instead.
>>
>> So do this:
>>
>> 	cat /proc/slabinfo | grep ext3
> 
> # head -2 /proc/slabinfo
> slabinfo - version: 2.1
> name       <active_objs> <num_objs> <objsize> <objperslab>
> 
> # grep ext2 /proc/slabinfo
> ext2_inode_cache       0          0       572            7
> ext2_xattr             0          0        48           81
> 
> # grep ext3 /proc/slabinfo
> 
> ext3_inode_cache   30207      41418       616            6
> ext3_xattr             0          0        48           81
> 
> # grep xfs /proc/slabinfo
> xfs_ili             2558       2576       140           28
> xfs_inode           2558       2565       448            9
> 
> # grep jfs /proc/slabinfo
> jfs_ip                 0          0      1048            3
> 
> So, the ext3 inode could grow another ~50 bytes without changing the
> slab allocation size ;-), and in fact other filesystem aren't noticably
> different.
> 
>> and be absolutely disgusted and horrified by the size of those inodes 
>> already, and ask yourself whether extending the block size to 48 bits will 
>> help or further hurt one of the biggest problems of ext3 right now?
> 
> This is then the biggest problem of all filesystems.
> 
>> (And yes, I realize that block numbers are just a small part of it. The 
>> "vfs_inode" is also a real problem - it's got _way_ too many large 
>> list-heads that explode on a 64-bit kernel, for example. Oh, well.
> 
> On a 32-bit system the vfs_inode is more than half of the size of the ext3
> inode, it is worse on 64-bit systems.
> 
>> My point is that things like this can make a very real issue _worse_ for all 
>> the people who don't care one whit about it)
> 
> The current group of changes will be a no-op if CONFIG_LBD isn't enabled,
> and I think I argued fairly strongly to also have a CONFIG_ flag to allow
> larger than 2TB file support only for those users that want it.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> 


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:10                 ` [Ext2-devel] " Andreas Dilger
  2006-06-09 18:22                   ` Linus Torvalds
  2006-06-09 18:40                   ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 18:41                   ` Jeff Garzik
  2 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:41 UTC (permalink / raw)
  To: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

Andreas Dilger wrote:
> The current group of changes will be a no-op if CONFIG_LBD isn't enabled,
> and I think I argued fairly strongly to also have a CONFIG_ flag to allow
> larger than 2TB file support only for those users that want it.

Please be realistic.

Distros will all want to turn this on, from now until eternity.

	Jeff



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:30                     ` Alex Tomas
  2006-06-09 18:38                       ` Linus Torvalds
@ 2006-06-09 18:43                       ` Jeff Garzik
  2006-06-09 18:50                       ` Diego Calleja
  2 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:43 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Linus Torvalds, Andreas Dilger, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel

Alex Tomas wrote:
>>>>>> Linus Torvalds (LT) writes:
>  LT> My point is, maintaining two different pieces is SIMPLER.
> 
> "different" is a key word here. why should we copy most of ext3 code
> into ext4?
> 
>  LT> It would be bigger, if you made ext3 do 48-bit block numbers.
> 
> nope, we re-use existing i_data w/o any changes. yes, we've made
> inode a bit larger to cache last found extent. this improves
> performance in some workloads noticable though.
> 
>  LT> See? ext3 would become strictly _worse_ for the majority of users, who 
>  LT> wouldn't get any advantage. That's my point.
> 
> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?

No, that would be worse.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:56               ` Andreas Dilger
  2006-06-09 17:32                 ` [Ext2-devel] " Greg KH
@ 2006-06-09 18:48                 ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:48 UTC (permalink / raw)
  To: Jeff Garzik, Alex Tomas, Christoph Hellwig, linux-fsdevel,
	ext2-devel, Mingming Cao, linux-kernel

Andreas Dilger wrote:
> Except that the only way that they will get extents is if they read some
> documentation that tells them to mount with "-o extents", which will also
> say "this is incompatible with older kernels - only use it if you aren't
> going to revert to older kernels".  If they try to mount such a filesystem
> it will report "trying to mount filesystem with incompatible feature",
> and "e2fsprogs" will report "incompatible feature extents - please upgrade
> your e2fsprogs" (for versions newer than Nov 2004).

False.  What will happen is that distros will default to extents, and 
users will continue to not read documentation, as usual.


> It's a lot better than e.g. the latest ubuntu which (apparently,
> I read) can't mount a kernel older than 2.6.15 because of udev (or
> sysfs?) changes.  It's better than e.g. reiserfs vs. reiser4 compatibility
> (which doesn't exist).  2.4 kernels probably can't mount a new udev root
> filesystem because none of the /dev files exist either.  2.4 kernels can't
> mount a filesystem that is using device mapper ("LVM 2.0") instead of
> "LVM 1.0".  All 2.2 kernel.org kernels couldn't use any system with RAID,
> because any distro worth its salt had upgraded the RAID code to a working
> (incompatible) version.

This is different.

The proposal is to change the thing called "ext3" to suddenly require 
kernels >= 2.6.18, while still calling it "ext3."

The above examples are actually proving my point.  The above examples 
had much more clear distinctions between incompatible upgrades.


> Nobody is forcing users to use extents.   Same with large inodes in ext3,
> which give a 7x speedup in samba4 performance - did this cause you any
> heartburn yet?   Large inodes + fast EAs are available for people who want
> to use it for a couple of years already, will soon allow nanosecond times
> and maybe one day in the distant future it will become the default but not
> yet.  In a few years, the support for extents in ext3 will be pervasive
> and most people won't care if they can boot to 2.4.10 or not, and if they
> care about this they will also know enough not to enable extents.  The ext3
> developers are a very cautious bunch, and don't force anything onto users.

I wouldn't use the word "cautious" to describe continually adding new, 
incompatible features to the main Linux filesystem.

You are as cautious as one can be, while adding potentially 
destabilizing features.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:30                     ` Alex Tomas
  2006-06-09 18:38                       ` Linus Torvalds
  2006-06-09 18:43                       ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 18:50                       ` Diego Calleja
  2006-06-09 19:08                         ` Diego Calleja
  2 siblings, 1 reply; 296+ messages in thread
From: Diego Calleja @ 2006-06-09 18:50 UTC (permalink / raw)
  To: Alex Tomas
  Cc: torvalds, adilger, alex, jeff, akpm, ext2-devel, linux-kernel,
	cmm, linux-fsdevel

El Fri, 09 Jun 2006 22:30:20 +0400,
Alex Tomas <alex@clusterfs.com> escribió:


>  LT> See? ext3 would become strictly _worse_ for the majority of users, who 
>  LT> wouldn't get any advantage. That's my point.
> 
> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?

Not at all, a config option may be disabled by lots of distros
and make backwards compatibility even more difficult than
is already going to be.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:38                       ` Linus Torvalds
@ 2006-06-09 18:50                         ` Chase Venters
  2006-06-09 19:00                           ` Chase Venters
                                             ` (2 more replies)
  2006-06-09 19:22                         ` Alex Tomas
  2006-06-09 20:16                         ` Andreas Dilger
  2 siblings, 3 replies; 296+ messages in thread
From: Chase Venters @ 2006-06-09 18:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Tomas, Andreas Dilger, Jeff Garzik, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

On Fri, 9 Jun 2006, Linus Torvalds wrote:

>
>
> On Fri, 9 Jun 2006, Alex Tomas wrote:
>>
>> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
>
> Let's put it this way:
> - have you had _any_ valid argument at all against "ext4"?
>
> Think about it. Honestly. Tell me anything that doesn't work?

It's about bundling. It's about being able to take your 3-year old 
dependable car and make it faster by bolting on new manifolds and 
turbochargers, rather than waiting a year for the manufacturer to release 
a totally new model (and buying totally new cars often means you're part 
of the manufacturer's debugging group, so be prepared to have things fail 
which require warranty work).

Now, granted, I really do agree with you about the whole code sharing 
thing. A fresh start is often just what you need. I'm just questioning if 
it wouldn't be better to do this fresh start immediately after going 
48-bit, rather than before. That way, existing users that want that extra 
umph can have it today.

> 		Linus

Cheers,
Chase

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:14                     ` [Ext2-devel] " Andreas Dilger
@ 2006-06-09 18:51                       ` Jeff Garzik
  2006-06-09 19:39                         ` Gerrit Huizenga
                                           ` (3 more replies)
  0 siblings, 4 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:51 UTC (permalink / raw)
  To: Matthew Frost, Alex Tomas, Jeff Garzik, Linus Torvalds,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

Andreas Dilger wrote:
> On Jun 09, 2006  13:04 -0500, Matthew Frost wrote:
>> Alex Tomas wrote:
>>> sorry, I disagree. for example, NUMA isn't default and shouldn't be.
>>> but we have it in the tree and any one may choose to use it.
>> NUMA is designed to cope with a hardware feature, which not everybody 
>> has.  Filesystem upgrades are not qualitatively similar; it does not 
>> depend on one's hardware design as to whether one uses ext3, let alone 
>> extents.  Your logic is faulty.
> 
> If you have a > 8TB block device (which is common in large RAID devices
> today, will be a single disk in a couple of years) then it is important
> that your filesystem work with this block device.
> 
> If ext2 and ext3 didn't support > 2GB files (which was a filesystem
> feature added in exactly the same way as extents are today, and nobody
> bitched about it then) then they would be relegated to the same status
> as minix and xiafs and all the other filesystems that are stuck in the
> "we can't change" or "we aren't supported" camps.

PRECISELY.  So you should stop modifying a filesystem whose design is 
admittedly _not_ modern!

ext3 is already essentially xiafs-on-life-support, when you consider 
today's large storage systems and today's filesystem technology.  Just 
look at the ugly hacks needed to support expanding an ext3 filesystem 
online.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:27                 ` [Ext2-devel] " Mike Snitzer
@ 2006-06-09 18:54                   ` Jeff Garzik
  2006-06-09 19:22                     ` Alex Tomas
  2006-06-10 13:49                   ` Adrian Bunk
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:54 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Andrew Morton, hch, linux-fsdevel, ext2-devel, cmm, linux-kernel

Mike Snitzer wrote:
> On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
>> Jeff Garzik wrote:
>> > I disagree completely...  it would be an obvious win:  people who want
>> > stability get that, people who want new features get that too.
>>
>> And developers have a better outlet for their wacky developmental 
>> urges...
> 
> And no real-world near-term progress is made for production users with
> modern requirements. What you're advocating breeds instability in the
> near-term.

Constantly patching the main, "stable" Linux filesystem breeds 
instability today.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:23       ` Michael Poole
@ 2006-06-09 18:55         ` Jeff Garzik
  2006-06-09 19:42           ` [Ext2-devel] " Gerrit Huizenga
  2006-06-10  0:49         ` Sven-Haegar Koch
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:55 UTC (permalink / raw)
  To: Michael Poole
  Cc: Andrew Morton, ext2-devel, linux-kernel, Christoph Hellwig, cmm,
	linux-fsdevel

Michael Poole wrote:
> Jeff Garzik writes:
> 
>> Andrew Morton wrote:
>>> Ted&co have been pretty good at avoiding compatibility problems.
>> Well, extents and 48bit make that track record demonstrably worse.
>>
>> Users are now forced to remember that, if they write to their
>> filesystem after using either $mmver or $korgver kernels, they are
>> locked out of using older kernels.
> 
> Users are also forced to remember that, if they use certain new
> distros or programs, they are locked out of using older kernels.  They
> are forced to remember that if they have certain newer hardware, they
> are locked out of using older kernels.  They are forced to remember
> that if they use ext3 (or XFS or JFS) _at all_ they are locked out of
> using older kernels.  Why single out this particular aspect of limited
> forward compatibility to harp on so much?

Because it's called backwards compat, when it isn't?
Because it is very difficult to find out which set of kernels you are 
locked out of?
Because the filesystem upgrade is stealthy, occurring as it does on the 
first data write?

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:40                   ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 18:59                     ` Andrew Morton
  2006-06-09 19:16                       ` Jeff Garzik
  2006-06-09 20:44                       ` Alan Cox
  0 siblings, 2 replies; 296+ messages in thread
From: Andrew Morton @ 2006-06-09 18:59 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: ext2-devel, linux-kernel, torvalds, cmm, linux-fsdevel, alex,
	adilger

On Fri, 09 Jun 2006 14:40:56 -0400
Jeff Garzik <jeff@garzik.org> wrote:

> Andreas Dilger wrote:
> > Having a single codebase for everyone means that it is continually maintained
> > and users of ext3 aren't left out in the cold.
> 
> That implies continually upgrading ext3 for newer storage technologies, 
> which in turn implies adding all sorts of incompatible formats to 
> support better storage scaling, and new usage models.

Look, I'm not certain either way on this - I really don't like the format
incompatibility and I'd like to see a breakdown of the performance benefits
of each of the proposed new features so perhaps we can cherrypick.  And I'm
deferring judgement until I've looked at some patches.

But Jeff, please stop this wild exaggeration!  "continually upgrading",
"all sorts of incompatible formats".  It's not helping anything.  

Today's ext3 is, afaik, 100% on-disk compatible with ext3 from five years
ago, and probably with RH's 2.2-based implementation.  So we have not done
and will not do the things which you are FUDding us about.

This is (again, as far as I recall) the first on-disk-incompatible change
in ext3 which has ever been proposed.  It's not a thing which is done
lightly and it's not a thing which is likely to happen again for a very long
time indeed.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:50                         ` [Ext2-devel] " Chase Venters
@ 2006-06-09 19:00                           ` Chase Venters
  2006-06-10 13:33                             ` Adrian Bunk
  2006-06-09 19:01                           ` Jeff Garzik
  2006-06-09 19:21                           ` Alan Cox
  2 siblings, 1 reply; 296+ messages in thread
From: Chase Venters @ 2006-06-09 19:00 UTC (permalink / raw)
  To: Chase Venters
  Cc: Linus Torvalds, Alex Tomas, Andreas Dilger, Jeff Garzik,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

On Fri, 9 Jun 2006, Chase Venters wrote:

> On Fri, 9 Jun 2006, Linus Torvalds wrote:
>
>> 
>>
>>  On Fri, 9 Jun 2006, Alex Tomas wrote:
>> > 
>> >  would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
>>
>>  Let's put it this way:
>>  - have you had _any_ valid argument at all against "ext4"?
>>
>>  Think about it. Honestly. Tell me anything that doesn't work?
>
> Now, granted, I really do agree with you about the whole code sharing thing. 
> A fresh start is often just what you need. I'm just questioning if it 
> wouldn't be better to do this fresh start immediately after going 48-bit, 
> rather than before. That way, existing users that want that extra umph can 
> have it today.
>

Let me clarify that I don't have a final answer or opinion for whether or 
not 48-bit belongs in ext3 or ext4. But I'm trying to illustrate that it's an 
important question to raise.

In Group A we have some number of users that must have 48-bit support by 
Date B. 48-bit support could be available in ext3 by Date A, before Date 
B. It could also be available in ext4 by Date X, along with a handful of 
other features.

Is Date X before Date B? If it's not, is it worth telling Group A to 
suffer for a while, or asking them to use ext4 before it's ready? These 
are the questions I'd have to know the answers to if I were the one 
casting a final decision.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:50                         ` [Ext2-devel] " Chase Venters
  2006-06-09 19:00                           ` Chase Venters
@ 2006-06-09 19:01                           ` Jeff Garzik
  2006-06-10 19:27                             ` Kyle Moffett
  2006-06-09 19:21                           ` Alan Cox
  2 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:01 UTC (permalink / raw)
  To: Chase Venters
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

Chase Venters wrote:
> Now, granted, I really do agree with you about the whole code sharing 
> thing. A fresh start is often just what you need. I'm just questioning 
> if it wouldn't be better to do this fresh start immediately after going 
> 48-bit, rather than before. That way, existing users that want that 
> extra umph can have it today.

Then you continue to crap up the code with

	if (48bit)
		...
	else
		...

etc.

The proper way to do this is "cp -a ext3 ext4" (excluding JBD as Andrew 
mentioned), and then let evolution take its course.

"Evolution" means the standard Linux developement -- patch the kernel, 
patch e4fsprogs, test, lather rinse repeat.  The best development 
platform for new features is one that _works_, and keeps working.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:50                       ` Diego Calleja
@ 2006-06-09 19:08                         ` Diego Calleja
  0 siblings, 0 replies; 296+ messages in thread
From: Diego Calleja @ 2006-06-09 19:08 UTC (permalink / raw)
  To: Diego Calleja
  Cc: akpm, jeff, ext2-devel, linux-kernel, torvalds, cmm,
	linux-fsdevel, alex, adilger

El Fri, 9 Jun 2006 20:50:00 +0200,
Diego Calleja <diegocg@gmail.com> escribió:

> Not at all, a config option may be disabled by lots of distros
> and make backwards compatibility even more difficult than
> is already going to be.

(I meant: Distros could switch it off, and in a two years
timeframe for some reason you could try to read data from
a disk created by a kernel with that feature and it wont
work, meanwhile with the current approach you'll be able
to use a mount flag.)

In my very humble user opinion, the big difference between ext2/3
and ext3/4 is that ext2/3 really was supposed to be on-disk
compatible, except for the journal. However ext4, AIUI, is
supposed to be _really_ different.

The kernel that includes the 48bit patches will already be a sort
of "ext4" filesystem, since 2.6.17 and previous kernels are not
going to be able to read it. Moving the source to ext4 or keeping
it in ext3/ is just about mainteinance, not about making a new
filesystem or not - that will happen as soon as the patches are
merged.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:21                           ` Alan Cox
@ 2006-06-09 19:13                             ` Chase Venters
  2006-06-09 19:24                             ` Alex Tomas
  1 sibling, 0 replies; 296+ messages in thread
From: Chase Venters @ 2006-06-09 19:13 UTC (permalink / raw)
  To: Alan Cox
  Cc: Chase Venters, Linus Torvalds, Alex Tomas, Andreas Dilger,
	Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel

On Fri, 9 Jun 2006, Alan Cox wrote:

> Ar Gwe, 2006-06-09 am 13:50 -0500, ysgrifennodd Chase Venters:
>> It's about bundling. It's about being able to take your 3-year old
>> dependable car and make it faster by bolting on new manifolds and
>> turbochargers, rather than waiting a year for the manufacturer to release
>> a totally new model
>
> Unfortunately in the software case if you want it in the base kernel you
> are bolting new manifolds on everyones car at once, and someone is going
> to have an engine explode as a result.

Someone _could_ have an engine explode... it's perfectly possible though 
that a well-tested 48-bit patch wouldn't cause anyone's ext3 to explode. 
(After all, the vehicle analogy breaks down here - software doesn't get 
worn out from being run at redline for too many miles.)

> Ext3 already has enough back compatiblity that you can replace the
> engine with a horse, we don't need any more in it thank you.

But just what are the costs at calling it quits now? Are we going to deny 
users something they need?

>
> Alan
>

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:59                     ` Andrew Morton
@ 2006-06-09 19:16                       ` Jeff Garzik
  2006-06-09 20:27                         ` [Ext2-devel] " Chase Venters
  2006-06-09 20:44                       ` Alan Cox
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ext2-devel, linux-kernel, torvalds, cmm, linux-fsdevel, alex,
	adilger

Andrew Morton wrote:
> On Fri, 09 Jun 2006 14:40:56 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
> 
>> Andreas Dilger wrote:
>>> Having a single codebase for everyone means that it is continually maintained
>>> and users of ext3 aren't left out in the cold.
>> That implies continually upgrading ext3 for newer storage technologies, 
>> which in turn implies adding all sorts of incompatible formats to 
>> support better storage scaling, and new usage models.
> 
> Look, I'm not certain either way on this - I really don't like the format
> incompatibility and I'd like to see a breakdown of the performance benefits
> of each of the proposed new features so perhaps we can cherrypick.  And I'm
> deferring judgement until I've looked at some patches.
> 
> But Jeff, please stop this wild exaggeration!  "continually upgrading",
> "all sorts of incompatible formats".  It's not helping anything.  
> 
> Today's ext3 is, afaik, 100% on-disk compatible with ext3 from five years
> ago, and probably with RH's 2.2-based implementation.  So we have not done
> and will not do the things which you are FUDding us about.
> 
> This is (again, as far as I recall) the first on-disk-incompatible change
> in ext3 which has ever been proposed.  It's not a thing which is done
> lightly and it's not a thing which is likely to happen again for a very long
> time indeed.

That's not really true, I include in the list EXT3_FEATURE_RO_COMPAT_*, 
EXT3_FEATURE_INCOMPAT_*, 32-bit uid/gid, ISTR some ACL-related mess, and 
the online resizing stuff that produces a filesystem slightly different 
than what mke2fs would produce for the same [larger] sized block device. 
  Red Hat has had at least one problem in the past where users were 
annoyed at format changes (htree?).

I certainly grant that extents and 48bit are format changes on a -much- 
larger scale than in the past.  Absolutely.

That's why I feel that this is a good point to calm down ext3 
development, and start putting stuff like extents into ext4.  If we are 
starting to make major changes to the format, that should be a signal 
that we are starting to work on a new filesystem, rather than patching 
an old one.

I disagree with the "years to stabilize ext4" argument, because we are 
starting from a known good point.  I think ext4 will be easier to 
maintain and tune for modern storage systems, if we don't have to worry 
as much about that stuff for ext3.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:35     ` [Ext2-devel] " Stephen C. Tweedie
@ 2006-06-09 19:20       ` Jeff Garzik
  2006-06-09 19:28         ` Alex Tomas
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:20 UTC (permalink / raw)
  To: linux-fsdevel, ext2-devel@lists.sourceforge.net, linux-kernel

Stephen C. Tweedie wrote:
> 	http://www.bullopensource.org/ext4/index.html


heh, some ext3 developers are even calling it ext4 already ;-)

	Jeff



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:50                         ` [Ext2-devel] " Chase Venters
  2006-06-09 19:00                           ` Chase Venters
  2006-06-09 19:01                           ` Jeff Garzik
@ 2006-06-09 19:21                           ` Alan Cox
  2006-06-09 19:13                             ` [Ext2-devel] " Chase Venters
  2006-06-09 19:24                             ` Alex Tomas
  2 siblings, 2 replies; 296+ messages in thread
From: Alan Cox @ 2006-06-09 19:21 UTC (permalink / raw)
  To: Chase Venters
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger

Ar Gwe, 2006-06-09 am 13:50 -0500, ysgrifennodd Chase Venters:
> It's about bundling. It's about being able to take your 3-year old 
> dependable car and make it faster by bolting on new manifolds and 
> turbochargers, rather than waiting a year for the manufacturer to release 
> a totally new model

Unfortunately in the software case if you want it in the base kernel you
are bolting new manifolds on everyones car at once, and someone is going
to have an engine explode as a result.

Ext3 already has enough back compatiblity that you can replace the
engine with a horse, we don't need any more in it thank you.

Alan

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:38                       ` Linus Torvalds
  2006-06-09 18:50                         ` [Ext2-devel] " Chase Venters
@ 2006-06-09 19:22                         ` Alex Tomas
  2006-06-09 19:22                           ` Jeff Garzik
  2006-06-09 20:16                         ` Andreas Dilger
  2 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 19:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

IMHO ...

the main reason is that ext4 would be treated as a new generation
fs which will be used for lots of new features probably. and it
will take long to get into production-ready state. at the same
time, proposed patches (at least extents itself) are heavily
tested in production and could be made available for our users
very soon.

thanks, Alex

>>>>> Linus Torvalds (LT) writes:

 LT> On Fri, 9 Jun 2006, Alex Tomas wrote:
 >> 
 >> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?

 LT> Let's put it this way:
 LT>  - have you had _any_ valid argument at all against "ext4"?

 LT> Think about it. Honestly. Tell me anything that doesn't work?

 LT> 		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:54                   ` Jeff Garzik
@ 2006-06-09 19:22                     ` Alex Tomas
  2006-06-09 19:23                       ` Jeff Garzik
  2006-06-09 22:49                       ` Valdis.Kletnieks
  0 siblings, 2 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 19:22 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, hch, cmm, linux-fsdevel


what if proposed patch is safer than an average fix?
(given that it's just out of usage unless enabled)

thanks, Alex

>>>>> Jeff Garzik (JG) writes:

 JG> Mike Snitzer wrote:
 >> On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
 >>> Jeff Garzik wrote:
 >>> > I disagree completely...  it would be an obvious win:  people who want
 >>> > stability get that, people who want new features get that too.
 >>> 
 >>> And developers have a better outlet for their wacky developmental 
 >>> urges...
 >> 
 >> And no real-world near-term progress is made for production users with
 >> modern requirements. What you're advocating breeds instability in the
 >> near-term.

 JG> Constantly patching the main, "stable" Linux filesystem breeds 
 JG> instability today.

 JG> 	Jeff





 JG> _______________________________________________
 JG> Ext2-devel mailing list
 JG> Ext2-devel@lists.sourceforge.net
 JG> https://lists.sourceforge.net/lists/listinfo/ext2-devel

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:22                         ` Alex Tomas
@ 2006-06-09 19:22                           ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:22 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
> the main reason is that ext4 would be treated as a new generation
> fs which will be used for lots of new features probably. and it
> will take long to get into production-ready state. at the same
> time, proposed patches (at least extents itself) are heavily
> tested in production and could be made available for our users
> very soon.

No -- that's a bad way to develop it, and a good way to ensure it will 
never get stable.

You want to start from a known good point, and keep it working. 
Standard iterative development model.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:22                     ` Alex Tomas
@ 2006-06-09 19:23                       ` Jeff Garzik
  2006-06-09 22:49                       ` Valdis.Kletnieks
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:23 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, ext2-devel, linux-kernel, hch, cmm, linux-fsdevel

Alex Tomas wrote:
> what if proposed patch is safer than an average fix?
> (given that it's just out of usage unless enabled)

Regardless of how you phrase it, it is an inescapable fact that you are 
developing new stuff in the main, supposedly-stable Linux filesystem.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:21                           ` Alan Cox
  2006-06-09 19:13                             ` [Ext2-devel] " Chase Venters
@ 2006-06-09 19:24                             ` Alex Tomas
  2006-06-09 19:25                               ` Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 19:24 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger

>>>>> Alan Cox (AC) writes:

 AC> Unfortunately in the software case if you want it in the base kernel you
 AC> are bolting new manifolds on everyones car at once, and someone is going
 AC> to have an engine explode as a result.

please, don't forget you need to enable it by mount option.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:24                             ` Alex Tomas
@ 2006-06-09 19:25                               ` Jeff Garzik
  2006-06-09 19:35                                 ` Alex Tomas
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:25 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, ext2-devel, linux-kernel, Chase Venters,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger, Alan Cox

Alex Tomas wrote:
>>>>>> Alan Cox (AC) writes:
> 
>  AC> Unfortunately in the software case if you want it in the base kernel you
>  AC> are bolting new manifolds on everyones car at once, and someone is going
>  AC> to have an engine explode as a result.
> 
> please, don't forget you need to enable it by mount option.

Irrelevant.  That's a development-only situation.  It will be enabled by 
default eventually, and should be considered in that light.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:20       ` Jeff Garzik
@ 2006-06-09 19:28         ` Alex Tomas
  2006-06-09 19:32           ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 19:28 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-fsdevel, ext2-devel@lists.sourceforge.net, linux-kernel

>>>>> Jeff Garzik (JG) writes:

 JG> Stephen C. Tweedie wrote:
 >> http://www.bullopensource.org/ext4/index.html


 JG> heh, some ext3 developers are even calling it ext4 already ;-)

I bet once you proposed this name few years ago ;)


thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:28         ` Alex Tomas
@ 2006-06-09 19:32           ` Jeff Garzik
  2006-06-09 19:41             ` Alex Tomas
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:32 UTC (permalink / raw)
  To: Alex Tomas; +Cc: linux-fsdevel, ext2-devel@lists.sourceforge.net, linux-kernel

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> Stephen C. Tweedie wrote:
>  >> http://www.bullopensource.org/ext4/index.html
> 
> 
>  JG> heh, some ext3 developers are even calling it ext4 already ;-)
> 
> I bet once you proposed this name few years ago ;)

I wouldn't consider myself an ext3 developer :)  ext2meta (online 
defrag) is my only real contribution.

I am much too much of an NIH guy, but I would be willing to participate 
in ext4 development.  Everybody here has no doubt experimented with 
their own from-scratch filesystem, and I am no different:
http://www.kernel.org/pub/linux/kernel/people/jgarzik/ibu/

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:35                                 ` Alex Tomas
@ 2006-06-09 19:35                                   ` Jeff Garzik
  2006-06-09 20:44                                   ` Joel Becker
  2006-06-11 20:14                                   ` [Ext2-devel] " grundig
  2 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:35 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Alan Cox, Chase Venters, Linus Torvalds, Andreas Dilger,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> Irrelevant.  That's a development-only situation.  It will be enabled
>  JG> by default eventually, and should be considered in that light.
> 
> that's your point of view. mine is that this option (and code)
> to be used only when needed. 

Regardless of any use "when needed," the code is in the codebase, and is 
thus the "if (metadata_v2) ... else ..." maintenance burden that has 
been discussed.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:25                               ` Jeff Garzik
@ 2006-06-09 19:35                                 ` Alex Tomas
  2006-06-09 19:35                                   ` [Ext2-devel] " Jeff Garzik
                                                     ` (2 more replies)
  0 siblings, 3 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 19:35 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Chase Venters,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger,
	Alan Cox

>>>>> Jeff Garzik (JG) writes:

 JG> Irrelevant.  That's a development-only situation.  It will be enabled
 JG> by default eventually, and should be considered in that light.

that's your point of view. mine is that this option (and code)
to be used only when needed. 

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:51                       ` Jeff Garzik
@ 2006-06-09 19:39                         ` Gerrit Huizenga
  2006-06-09 19:45                           ` [Ext2-devel] " Jeff Garzik
  2006-06-10 10:03                           ` Christoph Hellwig
  2006-06-09 19:49                         ` [Ext2-devel] " Theodore Tso
                                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 296+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 19:39 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Matthew Frost, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas


On Fri, 09 Jun 2006 14:51:55 EDT, Jeff Garzik wrote:
> 
> PRECISELY.  So you should stop modifying a filesystem whose design is 
> admittedly _not_ modern!

So just how long do you think it would take to get a modern filesystem
into the hands of real users, supported by the distros?  From community
building, through design, development, testing, delivery?

gerrit

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:32           ` Jeff Garzik
@ 2006-06-09 19:41             ` Alex Tomas
  0 siblings, 0 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 19:41 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: linux-fsdevel, ext2-devel@lists.sourceforge.net, Alex Tomas,
	linux-kernel

>>>>> Jeff Garzik (JG) writes:

 JG> I am much too much of an NIH guy, but I would be willing to
 JG> participate in ext4 development.  Everybody here has no doubt
 JG> experimented with their own from-scratch filesystem, and I am no
 JG> different:
 JG> http://www.kernel.org/pub/linux/kernel/people/jgarzik/ibu/

sigh, this is exactly that I was talking about ...

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:55         ` Jeff Garzik
@ 2006-06-09 19:42           ` Gerrit Huizenga
  2006-06-09 20:00             ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 19:42 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Michael Poole, Andrew Morton, ext2-devel, linux-kernel,
	Christoph Hellwig, cmm, linux-fsdevel

On Fri, 09 Jun 2006 14:55:56 EDT, Jeff Garzik wrote:
> 
> Because it's called backwards compat, when it isn't?
> Because it is very difficult to find out which set of kernels you are 
> locked out of?
> Because the filesystem upgrade is stealthy, occurring as it does on the 
> first data write?

Actually, the *only* point being contended here is running older
kernels on some newer filesystems (created originally with a newer
kernel), right?

Or do you have examples of where current kernels could not deal
with an ext3 feature at some point in time?

I would argue that 0.001% of all Linux *users* actually worry about
this - most of them are right here on the development mailing list.
So, that group is more vocal, for sure.  But, if it works for 99.99+%
users, aren't we still on the good path, from the point of view of
those people who actually *use* Linux the most?

gerrit

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:39                         ` Gerrit Huizenga
@ 2006-06-09 19:45                           ` Jeff Garzik
  2006-06-09 20:38                             ` Gerrit Huizenga
  2006-06-10 10:03                           ` Christoph Hellwig
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:45 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

Gerrit Huizenga wrote:
> On Fri, 09 Jun 2006 14:51:55 EDT, Jeff Garzik wrote:
>> PRECISELY.  So you should stop modifying a filesystem whose design is 
>> admittedly _not_ modern!
> 
> So just how long do you think it would take to get a modern filesystem
> into the hands of real users, supported by the distros?  From community
> building, through design, development, testing, delivery?

Start from a known working point, and keep it working...

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:51                       ` Jeff Garzik
  2006-06-09 19:39                         ` Gerrit Huizenga
@ 2006-06-09 19:49                         ` Theodore Tso
  2006-06-09 20:04                           ` Jeff Garzik
  2006-06-11 16:02                         ` Arjan van de Ven
  2006-06-12 22:06                         ` [Ext2-devel] " Pavel Machek
  3 siblings, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 19:49 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

On Fri, Jun 09, 2006 at 02:51:55PM -0400, Jeff Garzik wrote:
> ext3 is already essentially xiafs-on-life-support, when you consider 
> today's large storage systems and today's filesystem technology.  Just 
> look at the ugly hacks needed to support expanding an ext3 filesystem 
> online.

And what ugly hacks are you talking about?  It's actually quite clean;
with the latest e2fsprogs, you use the same command (resize2fs) for
doing both online and offline resizing.

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:55                 ` Jeff Garzik
  2006-06-09 17:12                   ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 19:57                   ` Theodore Tso
  2006-06-09 20:09                     ` Jeff Garzik
                                       ` (2 more replies)
  2006-06-10  0:07                   ` Olivier Galibert
  2 siblings, 3 replies; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 19:57 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

On Fri, Jun 09, 2006 at 12:55:09PM -0400, Jeff Garzik wrote:
> That is what the entirety of Linux development is -- step-by-step.
> 
> It is OBVIOUS that it would take five minutes to start ext4.
> 
> 1) clone a new tree
> 2) cp -a fs/ext3 fs/ext4
> 3) apply extent and 48bit patches
> 4) apply related e2fsprogs patches
> 
> Then update ext4 step-by-step, using the normal Linux development process.

We don't do this with the SCSI layer where we make a complete clone of
the driver layer so that there is a /usr/src/linux/driver/scsi and
/usr/src/linux/driver/scsi2, do we?  And we didn't do that with the
networking layer either, as we added ipsec, ipv6, softnet, and a whole
host of other changes and improvements.  

What we do instead is we have a series of patches, which can be made
available in various experimental trees, and as they get more
polishing and experience with people using it without any problems,
they can get merged into the -mm tree, and then eventually, when they
are deemed ready, into mainline.  That is also the normal Linux
development process, and it's worked quite well up until now with ext3.

Folks seem to be worried about ext3 being "too important to experiment
with", but the fact remains, we've been doing continuous improvement
with ext3 for quite some time, and it's been quite smooth.  The htree
introduction was essentially completely painless, for example --- and
people liked the fact that they could get the features of indexed
directories without needing to do a complete dump and restore of the
filesystem.

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:42           ` [Ext2-devel] " Gerrit Huizenga
@ 2006-06-09 20:00             ` Jeff Garzik
  2006-06-09 20:08               ` Alex Tomas
  2006-06-09 20:35               ` Theodore Tso
  0 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:00 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Michael Poole, Andrew Morton, ext2-devel, linux-kernel,
	Christoph Hellwig, cmm, linux-fsdevel

Gerrit Huizenga wrote:
> On Fri, 09 Jun 2006 14:55:56 EDT, Jeff Garzik wrote:
>> Because it's called backwards compat, when it isn't?
>> Because it is very difficult to find out which set of kernels you are 
>> locked out of?
>> Because the filesystem upgrade is stealthy, occurring as it does on the 
>> first data write?
> 
> Actually, the *only* point being contended here is running older
> kernels on some newer filesystems (created originally with a newer
> kernel), right?
> 
> Or do you have examples of where current kernels could not deal
> with an ext3 feature at some point in time?
> 
> I would argue that 0.001% of all Linux *users* actually worry about
> this - most of them are right here on the development mailing list.
> So, that group is more vocal, for sure.  But, if it works for 99.99+%
> users, aren't we still on the good path, from the point of view of
> those people who actually *use* Linux the most?

The overall objection is to treating ext3 as a highly mutable, 
one-size-fits-all filesystem.

Maybe there is value in moving some reiser4 concepts -- a set of 
metadata+algorithm plugins -- to the VFS level.  I dunno.

But for ext3 specifically, it seems like bolting on extents, 48bit, 
delayed allocation, and other new features weren't really suited for the 
original ext2-style design.  Outside of the support (and marketing, 
because that's all version numbers are in the end) issues already 
mentioned, I think it falls into the nebulous realm of "taste."

Rather than taking another decade to slowly fix ext2 design decisions, 
why not move the process along a bit more rapidly?  Release early, 
release often...

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:49                         ` [Ext2-devel] " Theodore Tso
@ 2006-06-09 20:04                           ` Jeff Garzik
  2006-06-09 20:57                             ` Stephen C. Tweedie
  2006-06-09 22:37                             ` [Ext2-devel] " Andreas Dilger
  0 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:04 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Matthew Frost, Alex Tomas,
	Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
	linux-fsdevel

Theodore Tso wrote:
> On Fri, Jun 09, 2006 at 02:51:55PM -0400, Jeff Garzik wrote:
>> ext3 is already essentially xiafs-on-life-support, when you consider 
>> today's large storage systems and today's filesystem technology.  Just 
>> look at the ugly hacks needed to support expanding an ext3 filesystem 
>> online.
> 
> And what ugly hacks are you talking about?  It's actually quite clean;
> with the latest e2fsprogs, you use the same command (resize2fs) for
> doing both online and offline resizing.

Consider a blkdev of size S1.  Using LVM we increase that value under 
the hood to size S2, where S2 > S1.  We perform an online resize from 
size S1 to S2.  The size and alignment of any new groups added will 
different from the non-resize case, where mke2fs was run directly on a 
blkdev of size S2.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:00             ` Jeff Garzik
@ 2006-06-09 20:08               ` Alex Tomas
  2006-06-09 20:10                 ` [Ext2-devel] " Jeff Garzik
  2006-06-09 20:35               ` Theodore Tso
  1 sibling, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 20:08 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Michael Poole,
	Christoph Hellwig, Gerrit Huizenga, cmm, linux-fsdevel

>>>>> Jeff Garzik (JG) writes:

 JG> Rather than taking another decade to slowly fix ext2 design decisions, 
 JG> why not move the process along a bit more rapidly?  Release early, 
 JG> release often...

that could be true, if we were talking about something yet to be
designed, coded and tested.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:57                   ` Theodore Tso
@ 2006-06-09 20:09                     ` Jeff Garzik
  2006-06-09 20:14                       ` Alex Tomas
  2006-06-09 20:38                     ` Joel Becker
  2006-06-12  8:58                     ` Jes Sorensen
  2 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:09 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

Theodore Tso wrote:
> We don't do this with the SCSI layer where we make a complete clone of
> the driver layer so that there is a /usr/src/linux/driver/scsi and
> /usr/src/linux/driver/scsi2, do we?  And we didn't do that with the
> networking layer either, as we added ipsec, ipv6, softnet, and a whole
> host of other changes and improvements.  
> 
> What we do instead is we have a series of patches, which can be made
> available in various experimental trees, and as they get more
> polishing and experience with people using it without any problems,
> they can get merged into the -mm tree, and then eventually, when they
> are deemed ready, into mainline.  That is also the normal Linux
> development process, and it's worked quite well up until now with ext3.

No, there is a key difference between ext3 and SCSI/etc.:  cruft is removed.

In ext3, old formats are supported for all eternity.


> Folks seem to be worried about ext3 being "too important to experiment
> with", but the fact remains, we've been doing continuous improvement
> with ext3 for quite some time, and it's been quite smooth.  The htree
> introduction was essentially completely painless, for example --- and

I disagree.  There were some distro annoyances as I recall.


> people liked the fact that they could get the features of indexed
> directories without needing to do a complete dump and restore of the
> filesystem.

Of course people always like new features.  :)

ext4 should allow you to deliver new features more rapidly, while 
keeping the existing ext3 happily stable.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:08               ` Alex Tomas
@ 2006-06-09 20:10                 ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:10 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Gerrit Huizenga, Andrew Morton, ext2-devel, linux-kernel,
	Michael Poole, Christoph Hellwig, cmm, linux-fsdevel

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> Rather than taking another decade to slowly fix ext2 design decisions, 
>  JG> why not move the process along a bit more rapidly?  Release early, 
>  JG> release often...
> 
> that could be true, if we were talking about something yet to be
> designed, coded and tested.

'cp ext3 ext4' already has its first two features:  extents and 48bit. 
And it works today.  Tested to the extent that the submittor has tested it.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:09                     ` Jeff Garzik
@ 2006-06-09 20:14                       ` Alex Tomas
  2006-06-09 20:28                         ` Jeff Garzik
  2006-06-19  7:48                         ` [Ext2-devel] " Helge Hafting
  0 siblings, 2 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 20:14 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Theodore Tso, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger

>>>>> Jeff Garzik (JG) writes:

 JG> No, there is a key difference between ext3 and SCSI/etc.:  cruft is removed.

 JG> In ext3, old formats are supported for all eternity.

we'd need this anyway. just to let users to migrate.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:38                       ` Linus Torvalds
  2006-06-09 18:50                         ` [Ext2-devel] " Chase Venters
  2006-06-09 19:22                         ` Alex Tomas
@ 2006-06-09 20:16                         ` Andreas Dilger
  2006-06-09 20:31                           ` Linus Torvalds
  2006-06-09 20:31                           ` Jeff Garzik
  2 siblings, 2 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 20:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas

On Jun 09, 2006  11:38 -0700, Linus Torvalds wrote:
> On Fri, 9 Jun 2006, Alex Tomas wrote:
> > would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
> 
> Let's put it this way:
>  - have you had _any_ valid argument at all against "ext4"?
> 
> Think about it. Honestly. Tell me anything that doesn't work?

It's funny that everyone is arguing to fork ext3 into ext4, for a feature
that will primarily allow it to work with large disks (that are already
here, not some wacky pipe dream of featuritis as Jeff thinks).  Yet the
same people that are advocating code duplication on a massive scale in
ext4 are against 5 lines of duplication between the VFS and a filesystem,
or in a couple of drivers here and there.

Having two copies of ext3 means we immediately get 2x the bugs, and
no guarantee that they will ever be fixed in ext3 (all of the ext3
maintainers will be solidly on the ext4 bandwagon if it comes to that).
It also means that two virtually identical copies of the same code
will be in memory at the same time (one for ext3 and another for ext4)
polluting the cache, even though some developers complain that a single
EXPORT_SYMBOL is "bloating" the kernel.  This also means two inode slabs
causing memory fragmentation, etc.

The other issue is that adding a new "ext4" filesystem type will cause
userspace tools to break that assume they know something about the
filesystem type.  They will all detect the filesystem as "ext3" and try
to mount it as such, when the required kernel filesystem is ext4.  Or
we will need to have "mkfs.ext4", "fsck.ext4", etc, for no particular
reason.

Either a system upgrades totally to ext4 to avoid the duplication of code
in memory (and breaks ALL backward compatibility, for no good reason), or
it lives with "only mount filesystems as 'ext4' when they need it" and the
code will rarely be used.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:16                       ` Jeff Garzik
@ 2006-06-09 20:27                         ` Chase Venters
  0 siblings, 0 replies; 296+ messages in thread
From: Chase Venters @ 2006-06-09 20:27 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, adilger, torvalds, alex, ext2-devel, linux-kernel,
	cmm, linux-fsdevel

On Fri, 9 Jun 2006, Jeff Garzik wrote:

> I disagree with the "years to stabilize ext4" argument, because we are 
> starting from a known good point.  I think ext4 will be easier to maintain 
> and tune for modern storage systems, if we don't have to worry as much about 
> that stuff for ext3.

Let's say we

# cp ext3 ext4
# cat extents 48bit | patch

and then roll it out in 2.6.18. That in and of itself is probably fine and 
stable (though it's no different than ext3 except for the name and the two 
new additions).

But are you going to do this again for ext5 when more features come along? 
Or are you going to warn ext4 users that the FS is not expected to be stable?

If you do the latter, be prepared for people to be wary of using it for a 
long while. The difference is between actual and perceived stability.

To put a finer point on it - I've got a system that's been running 
flawlessly for years on 2.5.3. It's actually been stable - never had any 
sort of crashing problem at all. But I'm essentially crazy for running 
that kernel. At the time I installed it, it certainly wasn't perceived as 
stable. If the computer in question were any more than a file server / 
iptables box for my home, I'd have said "well, hell, I think I'm going to 
have to do without 2.5 so that I can have something trustworthy."

(Amusingly enough, I started assembling a replacement for it recently, 
if only to have something newer and more capable. Having gone from 
Slackware to Gentoo I decided to give the April stable 
Debian release a whirl. Imagine my shock and awe when I watched Debian 
boot into a 2.4 kernel :P)

> 	Jeff
>

Cheers,
Chase

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:14                       ` Alex Tomas
@ 2006-06-09 20:28                         ` Jeff Garzik
  2006-06-19  7:48                         ` [Ext2-devel] " Helge Hafting
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:28 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Theodore Tso, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
> 
>  JG> No, there is a key difference between ext3 and SCSI/etc.:  cruft is removed.
> 
>  JG> In ext3, old formats are supported for all eternity.
> 
> we'd need this anyway. just to let users to migrate.

No, ext4 should remove some of the crufty old back-compat code.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:16                         ` Andreas Dilger
@ 2006-06-09 20:31                           ` Linus Torvalds
  2006-06-09 20:31                           ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 20:31 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Alex Tomas



On Fri, 9 Jun 2006, Andreas Dilger wrote:
> 
> It's funny that everyone is arguing to fork ext3 into ext4, for a feature
> that will primarily allow it to work with large disks (that are already
> here, not some wacky pipe dream of featuritis as Jeff thinks).  Yet the
> same people that are advocating code duplication on a massive scale in
> ext4 are against 5 lines of duplication between the VFS and a filesystem,
> or in a couple of drivers here and there.

You haven't actually listened to any of the arguments, have you?

Please remove me from the Cc on this thread. I'm not interested. 

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:16                         ` Andreas Dilger
  2006-06-09 20:31                           ` Linus Torvalds
@ 2006-06-09 20:31                           ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:31 UTC (permalink / raw)
  To: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

Andreas Dilger wrote:
> The other issue is that adding a new "ext4" filesystem type will cause
> userspace tools to break that assume they know something about the
> filesystem type.  They will all detect the filesystem as "ext3" and try
> to mount it as such, when the required kernel filesystem is ext4.  Or
> we will need to have "mkfs.ext4", "fsck.ext4", etc, for no particular
> reason.

Yes, you want those tools, and you want to call the filesystem ext4. 
Otherwise you'll never break free of the existing metadata formats 
(which are apparently changing over time _anyway_).


> Either a system upgrades totally to ext4 to avoid the duplication of code
> in memory (and breaks ALL backward compatibility, for no good reason), or

Correct.  You must upgrade totally to ext4.

And this happens ANYWAY once extents/etc. are enabled.  Its an upgrade.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:08     ` Jeff Garzik
  2006-06-09 15:25       ` Jeff Garzik
  2006-06-09 15:28       ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 20:32       ` Stephen C. Tweedie
  2006-06-09 20:46         ` Linus Torvalds
  2 siblings, 1 reply; 296+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 20:32 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Stephen Tweedie, ext2-devel@lists.sourceforge.net,
	linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
	Andreas Dilger

Hi,

On Fri, 2006-06-09 at 11:08 -0400, Jeff Garzik wrote:

> Stuffing more and more features into fs/ext3 means you are following the 
> path that leads to reiser4...  where EVERYTHING under the hood is 
> mutable, all within fs/ext3.

> Why do you insist upon calling the end result ext3, when the truth is 
> that you are slowing rewriting ext3?

The trouble is, does it make sense to do otherwise?

Should large file support have resulted in ext4?  ACLs/xattrs, ext5?
Htree, ext6?  Online resize, ext7?  Yes, let's make it ext8 for extents!

> Here's a key question for ext3 developers, which I bet has no answer: 
> when is it enough?

When is the Linux syscall interface enough?  When should we just bump it
and cut out all the compatibility interfaces?

No, we don't; we let people configure certain obsolete bits out (a.out
support etc), but we keep it in the tree despite the indirection cost to
maintain multiple interfaces etc.

> > While this is partly true, one of the big benefits is that you can
> > transparently upgrade your system to use the new features and improve
> > performance without a long outage window.  Having a completely separate
> 
> Changing the name to ext4 doesn't erase this capability.

The name is irrelevant here.  FWIW, something we've considered is to
make the user visibility of a batch of new features more obvious by
labelling them "ext4", so "mke4fs" would automatically enable those
features and the filesystem could register "ext4" as an fs type in the
kernel.

But that could be done without forking the codebase.  It would just be a
matter of binding feature flag sets to the given name.  What you're
talking about is forking the codebase itself, and I don't see the need
for that right now.

> > ext4 filesystem doesn't improve the compatibility story at all.  There
> > has been renewed discussion on implementing "mounting ext3 without a
> > journal", just for a recovery mode, because ext2 will not be modified
> > to get all of these features (running e2fsck on a huge filesystem each
> > reboot would be insane).
> 
> So now you are going backwards, and implementing ext2-within-ext3?

No, it would be a readonly emergency mode, not writable ext2 at all.

> Are you ready to admit, yet, that ext3 is 100% mutable in the minds of 
> ext3 developers?

The kernel syscall interface is 100% mutable by the same criteria.
Except in each case it's not "mutable", it's "extensible", which is a
*far* different thing.

> If all the ext3 developers are on board, that just implies that there is 
> no clear definition of what "ext3" really means.  With this patch 
> series, and with future plans described here and elsewhere, the name 
> "ext3" will become more and more meaningless.

Does the continuing addition of futexes, inotify, $FAVOURITE_FEATURE_OF_
THE_DAY mean that "Linux" is more and more meaningless?  I fail to see
much difference.  An application coded for linux-2.0's public interfaces
will, for the most part, if we do our jobs right, continue to work on
2.6.  An application coded for 2.6, expecting to use AIO, large files,
futexes and NPTL, will definitely not run on 2.0.  The incremental
extension of ext3 doesn't seem to be a fundamentally different concept.

Backwards compatibility of the kernel ABI is considered important; so in
ext3, the developers have a high regard for backwards compatibility of
on-disk data.  Personally I see that as an asset, not a problem; indeed,
it was the single most important design criterion from the outset.

--Stephen

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:00             ` Jeff Garzik
  2006-06-09 20:08               ` Alex Tomas
@ 2006-06-09 20:35               ` Theodore Tso
  2006-06-09 21:41                 ` Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 20:35 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Gerrit Huizenga, Michael Poole, Andrew Morton, ext2-devel,
	linux-kernel, Christoph Hellwig, cmm, linux-fsdevel

On Fri, Jun 09, 2006 at 04:00:44PM -0400, Jeff Garzik wrote:
> But for ext3 specifically, it seems like bolting on extents, 48bit, 
> delayed allocation, and other new features weren't really suited for the 
> original ext2-style design.  Outside of the support (and marketing, 
> because that's all version numbers are in the end) issues already 
> mentioned, I think it falls into the nebulous realm of "taste."

If is very much a matter of taste, why are you trying to dictate to
the ext2 developers how they choose to do things?  As long as it
works, and we haven't screwed up yet, I'd argue this is falls into the
category of letting each subsystem decide how they best work.  The way
DaveM and the networking team works is quite different from how the
SCSI developers work or the XFS team work --- it's not a
one-size-fits-all sort of thing.

And I'd also dispute with your "weren't really suited for the original
ext2-style design" comment.  Ext2/3 was always designed to be
extensible from the start, and we've successfully added features quite
successfully for quite a while.

> Rather than taking another decade to slowly fix ext2 design decisions, 
> why not move the process along a bit more rapidly?  Release early, 
> release often...

I don't think it will be another decade, but yes, regardless of
whether we do a code fork or not, it will take time.  Basically, you
and the ext2 developers have a disagreement about whether or not a
code fork will actually move the process along more quickly or not.
Either way, we will be releasing early and often, so people can test
it out and comment on it.  Releasing patches to LKML is just the first
step in this process.

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:57                   ` Theodore Tso
  2006-06-09 20:09                     ` Jeff Garzik
@ 2006-06-09 20:38                     ` Joel Becker
  2006-06-09 20:50                       ` Dave Jones
  2006-06-09 21:03                       ` Theodore Tso
  2006-06-12  8:58                     ` Jes Sorensen
  2 siblings, 2 replies; 296+ messages in thread
From: Joel Becker @ 2006-06-09 20:38 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 03:57:50PM -0400, Theodore Tso wrote:
> We don't do this with the SCSI layer where we make a complete clone of
> the driver layer so that there is a /usr/src/linux/driver/scsi and
> /usr/src/linux/driver/scsi2, do we?  And we didn't do that with the
> networking layer either, as we added ipsec, ipv6, softnet, and a whole
> host of other changes and improvements.  

Ted,
	We don't have any permanent, physical representation of the
state either.  With a filesystem we do.  I don't care how many changes
you made to the SCSI stack.  The code from a year ago could be entirely
different.  However, if the old stack and the new stack both support
card X, then it Just Works.  The Adaptec driver is a case in point.
When the new driver was still flaky, folks and distros could select the
old driver with impunity.  Running the new driver didn't fundamentally
change your Adaptec card so you couldn't run the old one.
	Filesystem features are different.  There is a permanent state
that the older code cannot read.  Alex claims people just shouldn't use
"-o extents", but the fact is their distro will choose it for them.  We
have multiboot machines in our test lab, because like many people we
don't have unlimited funds.  What happened when we installed the 2.6
distros?  All of a sudden the older 2.4 distros wouldn't mount the
shared filesystems, becuase of ext3 features.  This wasn't the kernel
driver, this was merely the tools!  Surprise!  We made no choice to use
new features, and they were thrust upon us.  This will happen to others.

Joel

-- 

"Sometimes one pays most for the things one gets for nothing."
        - Albert Einstein

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:45                           ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 20:38                             ` Gerrit Huizenga
  0 siblings, 0 replies; 296+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 20:38 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Matthew Frost, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas

On Fri, 09 Jun 2006 15:45:16 EDT, Jeff Garzik wrote:
> Gerrit Huizenga wrote:
> > On Fri, 09 Jun 2006 14:51:55 EDT, Jeff Garzik wrote:
> >> PRECISELY.  So you should stop modifying a filesystem whose design is 
> >> admittedly _not_ modern!
> > 
> > So just how long do you think it would take to get a modern filesystem
> > into the hands of real users, supported by the distros?  From community
> > building, through design, development, testing, delivery?
> 
> Start from a known working point, and keep it working...

Then clone all the user level packages, work with distros to get
the new packages included, update the man pages, get those included,
make sure bug fixes for ext2 get propagated to ext4 - oh, and those
for ext3 as well.  And then work with mainline to decide when to
change from EXPERIMENTAL to stable, then decide how to get enough
users to make sure the testing is good enough, then work with the
distros to enable, then work with them to agree to provide support
to their most important, biggest, highest risk customers with this
new filesystem used by only 20 people because it isn't the default.

The repeat this whole discussion with each new feature proposed for
ext4 over the next 5 years, watch developers get disillusioned yet
again, watch 4 new competing filesystems pop up and try to be the
next great filesystem.  Watch them all fade away as the ultimately
battle for mindshare wears them out and the ever cascading war between
stability and support versus new features brings us back to where we
are again today.

Or just add the feature that the entire ext3 development community
thinks is stable enough to move forward, is well enough integrated
with the existing code to *not* be a bolt on, and is incrementally
small enough to be managed by its very own developer community
without the overhead of splitting that community even further.

The short words sound good but in reality we should all have lived
through this long enough to know better.

gerrit

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:47           ` Jeff Garzik
  2006-06-09 15:55             ` Alex Tomas
  2006-06-09 16:01             ` Linus Torvalds
@ 2006-06-09 20:38             ` Stephen C. Tweedie
  2 siblings, 0 replies; 296+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 20:38 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Stephen Tweedie, ext2-devel@lists.sourceforge.net,
	linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
	Andreas Dilger

Hi,

On Fri, 2006-06-09 at 11:47 -0400, Jeff Garzik wrote:

> think about The Experience:  Suddenly users that could use 2.4.x and 
> 2.6.x are locked into 2.6.18+, by the simple and common act of writing 
> to a file.

No.

The default is --- user writes to file on 2.6.18+, goes back to 2.4, and
everything still keeps on working just fine.  

Or, user says "I *specifically* request this feature that I *know* is
not compatible with older kernels", and then they get just that.
Extents are not going to be on by default.  Please, we've got more sense
than that!

Just like the developer who says "I *specifically* code for this fancy
new vmsplice syscall" gets exactly the same.

--Stephen

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:59                     ` Andrew Morton
  2006-06-09 19:16                       ` Jeff Garzik
@ 2006-06-09 20:44                       ` Alan Cox
  2006-06-11 15:52                         ` [Ext2-devel] " Arjan van de Ven
  1 sibling, 1 reply; 296+ messages in thread
From: Alan Cox @ 2006-06-09 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, ext2-devel, linux-kernel, torvalds, cmm,
	linux-fsdevel, alex, adilger

Ar Gwe, 2006-06-09 am 11:59 -0700, ysgrifennodd Andrew Morton:
> Today's ext3 is, afaik, 100% on-disk compatible with ext3 from five years
> ago, and probably with RH's 2.2-based implementation.  

If your files are under 2GB long, you've not used any attributes,
SELinux labels or various other things maybe. In the practical real
world case it isn't. I doubt many Fedora/Red Hat users have a single FS
from RHEL4/FC1 onwards that is readable by 2.2 ext3 (or most 2.4 ext3)

OTOH the number of complaints about this is minimal, people want to go
forwards in a controlled manner not backwards.

Alan

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:35                                 ` Alex Tomas
  2006-06-09 19:35                                   ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 20:44                                   ` Joel Becker
  2006-06-09 20:49                                     ` Alex Tomas
  2006-06-11 20:14                                   ` [Ext2-devel] " grundig
  2 siblings, 1 reply; 296+ messages in thread
From: Joel Becker @ 2006-06-09 20:44 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger,
	Alan Cox

On Fri, Jun 09, 2006 at 11:35:43PM +0400, Alex Tomas wrote:
> that's your point of view. mine is that this option (and code)
> to be used only when needed. 

	Which is irrelevant.  If you tell the world "extents are
better!", they're going to turn them on regardless of whether you
consider their situation a good candidate.  Many non-kernel-hackers
started using reiserfs before it was usably stable, just because
"journaling is better!"

Joel

-- 

Life's Little Instruction Book #347

	"Never waste the oppourtunity to tell someone you love them."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:32       ` Stephen C. Tweedie
@ 2006-06-09 20:46         ` Linus Torvalds
  2006-06-09 20:56           ` Alex Tomas
  2006-06-20  6:15           ` [Ext2-devel] " Qi Yong
  0 siblings, 2 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 20:46 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrew Morton, Jeff Garzik, ext2-devel@lists.sourceforge.net,
	linux-kernel, Mingming Cao, linux-fsdevel, Andreas Dilger



On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
> 
> When is the Linux syscall interface enough?  When should we just bump it
> and cut out all the compatibility interfaces?
> 
> No, we don't; we let people configure certain obsolete bits out (a.out
> support etc), but we keep it in the tree despite the indirection cost to
> maintain multiple interfaces etc.

Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN WAYS THAT 
MIGHT BREAK OLD USERS.

Your point was exactly what?

Btw, where did that 2TB limit number come from? Afaik, it should be 16TB 
for a 4kB filesystem, no?

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:44                                   ` Joel Becker
@ 2006-06-09 20:49                                     ` Alex Tomas
  2006-06-09 21:11                                       ` Joel Becker
  0 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 20:49 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger,
	Alan Cox

>>>>> Joel Becker (JB) writes:

 JB> On Fri, Jun 09, 2006 at 11:35:43PM +0400, Alex Tomas wrote:
 >> that's your point of view. mine is that this option (and code)
 >> to be used only when needed. 

 JB> 	Which is irrelevant.  If you tell the world "extents are
 JB> better!", they're going to turn them on regardless of whether you
 JB> consider their situation a good candidate.  Many non-kernel-hackers
 JB> started using reiserfs before it was usably stable, just because
 JB> "journaling is better!"

I haven't said that so far. I feel absolutely comfortable to put
as many warnings as needed from your point of view.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:38                     ` Joel Becker
@ 2006-06-09 20:50                       ` Dave Jones
  2006-06-09 21:09                         ` Joel Becker
  2006-06-09 21:32                         ` [Ext2-devel] " Jeff Garzik
  2006-06-09 21:03                       ` Theodore Tso
  1 sibling, 2 replies; 296+ messages in thread
From: Dave Jones @ 2006-06-09 20:50 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 01:38:03PM -0700, Joel Becker wrote:
 > that the older code cannot read.  Alex claims people just shouldn't use
 > "-o extents", but the fact is their distro will choose it for them.

.. on partitions over a certain size, which couldn't be read with
older ext3 filesystems _anyway_

Enabling it by default on partitions of a size less than those
that need extents seems to be somewhat pointless to me?

Am I missing something fundamental that precludes the use of both
extent-based and current existing filesystems from the same code
simultaneously ?

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:56               ` Jeff Garzik
  2006-06-09 16:07                 ` Alex Tomas
@ 2006-06-09 20:52                 ` Stephen C. Tweedie
  2006-06-09 21:47                   ` [Ext2-devel] " Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 20:52 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Stephen Tweedie, ext2-devel@lists.sourceforge.net,
	linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
	Alex Tomas, Andreas Dilger

Hi,

On Fri, 2006-06-09 at 11:56 -0400, Jeff Garzik wrote:

> Think about how this will be deployed in production, long term.
> 
> If extents are not made default at some point, then no one will use the 
> feature, and it should not be merged.

Features such as ACLs and SELinux are still not on by default and are
most *definitely* used.  This is a bogus argument.

> And when extents are default, you have this blizzard-of-feature-flags 
> stealth upgrade event occur _sometime_ after they boot into the new fs 
> for the first time.

No.  I don't see it ever being forced on in the kernel by default, so
there will be no such "stealth upgrades".

Rather, if it is "made default", that will be done by setting the flag
by default on newly-created filesystems in mke2fs.  We won't be playing
magic on existing filesystems.

And to avoid confusion, I am *entirely* open to the idea of making it
only ever default to on in mke2fs at some point in the future where we
batch a set of incompat features with the "ext4" label, so that "mke2fs
-O ext4", or "mke4fs", would set it.  That has already been proposed on
ext2-devel; we're nowhere near the stage of making that default yet.

--Stephen

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:46         ` Linus Torvalds
@ 2006-06-09 20:56           ` Alex Tomas
  2006-06-20  6:15           ` [Ext2-devel] " Qi Yong
  1 sibling, 0 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 20:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
	linux-fsdevel, Andreas Dilger

>>>>> Linus Torvalds (LT) writes:

 LT> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
 LT> Btw, where did that 2TB limit number come from? Afaik, it should be 16TB 
 LT> for a 4kB filesystem, no?

2TB => 16K group descriptors * 8 (sizeof(void*) on 64bit arch) => 128K -- slab limit

we have a fix for this as well.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:04                           ` Jeff Garzik
@ 2006-06-09 20:57                             ` Stephen C. Tweedie
  2006-06-09 21:49                               ` Jeff Garzik
  2006-06-09 22:37                             ` [Ext2-devel] " Andreas Dilger
  1 sibling, 1 reply; 296+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 20:57 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Theodore Ts'o, Matthew Frost, Stephen Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

Hi,

On Fri, 2006-06-09 at 16:04 -0400, Jeff Garzik wrote:

> Consider a blkdev of size S1.  Using LVM we increase that value under 
> the hood to size S2, where S2 > S1.  We perform an online resize from 
> size S1 to S2.  The size and alignment of any new groups added will 
> different from the non-resize case, where mke2fs was run directly on a 
> blkdev of size S2.

No, they won't.  We simply grow the last block group in the filesystem
up to the size where we'd naturally add another block group anyway; and
then, we add another block group exactly where it would have been on a
fresh mkfs.

--Stephen

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:38                     ` Joel Becker
  2006-06-09 20:50                       ` Dave Jones
@ 2006-06-09 21:03                       ` Theodore Tso
  2006-06-09 21:24                         ` Joel Becker
  2006-06-09 23:48                         ` Jeff Garzik
  1 sibling, 2 replies; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 21:03 UTC (permalink / raw)
  To: Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 01:38:03PM -0700, Joel Becker wrote:
> 	Filesystem features are different.  There is a permanent state
> that the older code cannot read.  Alex claims people just shouldn't use
> "-o extents", but the fact is their distro will choose it for them.  We
> have multiboot machines in our test lab, because like many people we
> don't have unlimited funds.  What happened when we installed the 2.6
> distros?  All of a sudden the older 2.4 distros wouldn't mount the
> shared filesystems, becuase of ext3 features.  

This is going to happen regardless of whether we call the code base
"ext3" or "ext4".  Anytime you make format changes (in this case, to
support larger disk sizes) older kernels won't support it any more.
Surprise!  

In the case you were referring to, one specific distribution, Red Hat,
silently added the extended attributes feature to the filesystem
because it was needed by SELINUX.  This was actually a backwards
compatible feature, so that older 2.4 based distributions could
*mount* the filesystem.  Unfortunately e2fsck needs to be more
careful, and so the problem was that the older distribution's fsck
wasn't able to check the filesystem.  But this was actually Red Hat's
fault, in that they shouldn't have transparently added the extended
attribute feature without first asking the user's permission.   

Being able to forward upgrade to newer filesystem formats is a good
thing, and has a long history; users don't like to do a backup,
reformat, and restore pass if they can't help that.  Heck, Microsoft
Windows even has a way that they can upgrade a FAT filesystem to their
latest NTFSv5 filesystem using a userspace progam.  Providing such a
capability is not a bad thing, and in fact it is a good thing.  The
bad thing to do is to do the conversion without first asking the
user's permission (for example just as Windows XP does when you first
boot a preinstalled system on a laptop for the first time).

People seem to be arguing that just because an distribution installer
_could_ do a backwards incompatible upgrade without first asking
permission first, we must not provide the capability at all, and make
it be the case that the only way to upgrade from ext3 to ext4 is with
a backup, reformat, and restore.  Surely that doesn't make sense!

> This wasn't the kernel driver, this was merely the tools!  Surprise!
> We made no choice to use new features, and they were thrust upon us.
> This will happen to others.

I suspect that Red Hat has learned from that past experience, and
won't be making that mistake again, at least without explicitly
requesting the user's permission.  So how about we trust the
distributions to be a bit more careful this time around?

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:50                       ` Dave Jones
@ 2006-06-09 21:09                         ` Joel Becker
  2006-06-09 21:51                           ` Mike Snitzer
  2006-06-09 21:32                         ` [Ext2-devel] " Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Joel Becker @ 2006-06-09 21:09 UTC (permalink / raw)
  To: Dave Jones, Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton,
	ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
	Andreas Dilger

On Fri, Jun 09, 2006 at 04:50:36PM -0400, Dave Jones wrote:
> On Fri, Jun 09, 2006 at 01:38:03PM -0700, Joel Becker wrote:
>  > that the older code cannot read.  Alex claims people just shouldn't use
>  > "-o extents", but the fact is their distro will choose it for them.
> 
> .. on partitions over a certain size, which couldn't be read with
> older ext3 filesystems _anyway_

	Certainly that would be fine.  Is that what will actually
happen?  Experience says no.  Even if you get it right in your distro,
not all distros will.  Heck, can you promise me that your distro will
provide e2fsprogs updates to its older releases so that multiboot will
continue to work?

Joel

-- 

"Behind every successful man there's a lot of unsuccessful years."
        - Bob Brown

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:49                                     ` Alex Tomas
@ 2006-06-09 21:11                                       ` Joel Becker
  2006-06-09 21:20                                         ` Alex Tomas
  0 siblings, 1 reply; 296+ messages in thread
From: Joel Becker @ 2006-06-09 21:11 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger,
	Alan Cox

On Sat, Jun 10, 2006 at 12:49:54AM +0400, Alex Tomas wrote:
> >>>>> Joel Becker (JB) writes:
>  JB> 	Which is irrelevant.  If you tell the world "extents are
>  JB> better!", they're going to turn them on regardless of whether you
>  JB> consider their situation a good candidate.  Many non-kernel-hackers
>  JB> started using reiserfs before it was usably stable, just because
>  JB> "journaling is better!"
> 
> I haven't said that so far. I feel absolutely comfortable to put
> as many warnings as needed from your point of view.

	When I say "you", I mean the general consensus.  You can scream
"don't do this" as loud as you want, the world might drown you out.  Not
every random person that sees "new extents in ext3" is going to know
that Alex is the authority.  They certainly aren't going to read the
documentation.  They'll read some comment on some website that says "all
you need is '-o extents'!"

Joel


-- 

"A narcissist is someone better looking than you are."  
         - Gore Vidal

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:11                                       ` Joel Becker
@ 2006-06-09 21:20                                         ` Alex Tomas
  2006-06-09 21:29                                           ` Joel Becker
  0 siblings, 1 reply; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 21:20 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, cmm, linux-fsdevel, Andreas Dilger, Alan Cox

>>>>> Joel Becker (JB) writes:

 JB> 	When I say "you", I mean the general consensus.  You can scream
 JB> "don't do this" as loud as you want, the world might drown you out.  Not
 JB> every random person that sees "new extents in ext3" is going to know
 JB> that Alex is the authority.  They certainly aren't going to read the
 JB> documentation.  They'll read some comment on some website that says "all
 JB> you need is '-o extents'!"

two point here:
a) warnings should be made visible at mount time,
   something like printk(KERN_CRIT ...)
b) I don't think you're going to fight all crazy people in the world,
   they'll definitely find a way to break something:
   data or something else.

thanks, Alex

PS. in the end, "extents" option affects *new* files only. and one
    can boot extents-enabled kernel and convert fs back.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:03                       ` Theodore Tso
@ 2006-06-09 21:24                         ` Joel Becker
  2006-06-09 21:36                           ` [Ext2-devel] " Chase Venters
  2006-06-09 21:51                           ` Theodore Tso
  2006-06-09 23:48                         ` Jeff Garzik
  1 sibling, 2 replies; 296+ messages in thread
From: Joel Becker @ 2006-06-09 21:24 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 05:03:19PM -0400, Theodore Tso wrote:
> This is going to happen regardless of whether we call the code base
> "ext3" or "ext4".  Anytime you make format changes (in this case, to
> support larger disk sizes) older kernels won't support it any more.
> Surprise!  

	Of course format changes break things.  But if you claim that
"X" and "Y" are the same thing, and they aren't, people won't see it
coming.

> wasn't able to check the filesystem.  But this was actually Red Hat's
> fault, in that they shouldn't have transparently added the extended
> attribute feature without first asking the user's permission.   

	Sure it was Red Hat's fault.  Knowing who to blame doesn't solve
the existing problem, though.  They never even put out e2fsck upgrades
for older distros, which would have solved the problem just as easily.

> Being able to forward upgrade to newer filesystem formats is a good
> thing, and has a long history; users don't like to do a backup,
> reformat, and restore pass if they can't help that.  Heck, Microsoft
> Windows even has a way that they can upgrade a FAT filesystem to their
> latest NTFSv5 filesystem using a userspace progam.  Providing such a
> capability is not a bad thing, and in fact it is a good thing.  The
> bad thing to do is to do the conversion without first asking the
> user's permission (for example just as Windows XP does when you first
> boot a preinstalled system on a laptop for the first time).

	This entire statement is true.  However, note that "FAT" becomes
"NTFSv5", and there is no expectation, implicit or explicit, that you
can use "FAT" to mount the changed volume.
	You can call the new filesystem ext4, and mount an old ext3 as
ext4, and guess what?  You're just as forward compatible, but now you've
explictly specified the lack of backwards compatibility.  You could even
provide a userspace tool just like in your example to switch an INCOMPAT
feature.

> People seem to be arguing that just because an distribution installer
> _could_ do a backwards incompatible upgrade without first asking
> permission first, we must not provide the capability at all, and make
> it be the case that the only way to upgrade from ext3 to ext4 is with
> a backup, reformat, and restore.  Surely that doesn't make sense!

	There is no reason you need a backup/restore cycle.  Mount it as
ext4, and forever forward its an ext4.  In the ext2->ext3 cycle, we
called it "tunefs -J".
 
> So how about we trust the distributions to be a bit more careful
> this time around?

	Haha, you're funny.
	Seriously, Ted, I personally have one concern here.  I don't
care much about the maintainability of one code base versus two.  Both
have advantages and problems.  I care a little that my "used to be
stable" ext3 code base might be destabilized, but I know that the ext2/3
team has been better than most at stable code transitions.
	What I do care is that "ext3" can no longer mount partition X.
That's gonna happen.  This thing still has the same name, but it is in
actuality something very different.  When ext2 could no longer mount a
journaled version of itself, we changed it to "ext3".
	Heck, forget the name, just make the breakage more explicit.  Do
it at mkfs/tunefs time.  "tunefs -extents" or "mkfs -t ext3 -extents".
A mount option assumes that you can do with or without it.  If you do it
once, you can mount the next time without it and stuff Just Works.  Even
htree follows this.  A clean unmount leaves a clean directory structure
that a non-htree driver can use.

Joel

-- 

"Not being known doesn't stop the truth from being true."
        - Richard Bach

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:20                                         ` Alex Tomas
@ 2006-06-09 21:29                                           ` Joel Becker
  2006-06-09 21:33                                             ` Alex Tomas
  2006-06-09 21:43                                             ` Joel Becker
  0 siblings, 2 replies; 296+ messages in thread
From: Joel Becker @ 2006-06-09 21:29 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, cmm, linux-fsdevel, Andreas Dilger, Alan Cox

On Sat, Jun 10, 2006 at 01:20:31AM +0400, Alex Tomas wrote:
> two point here:
> a) warnings should be made visible at mount time,
>    something like printk(KERN_CRIT ...)

	Too late, they're already broken!

> b) I don't think you're going to fight all crazy people in the world,
>    they'll definitely find a way to break something:
>    data or something else.

	Certainly not the crazy people.  But the random person who's
just humming along?  We should be nice to them.
 
> PS. in the end, "extents" option affects *new* files only. and one
>     can boot extents-enabled kernel and convert fs back.

	I just mentioned to Ted in another mail, since this is a
"permanent" change to the on-disk structure, why is this a mount option?
Shouldn't it rather be a tunefs(8)/mkfs(8) option?
	In general, anything you pass to "mount -o" is optional.  You
can mount with option X, then unmount and mount without option X.  Most
people "expect" this to work (Principle of Least Surprise).  So, when
you do:

    # mount -o extents /fs1
    # create_file /fs1/newfile
    # umount /fs1
    # mount /fs1

it breaks.  Lease Surprise expects it to work.
	However, tunefs(8) and mkfs(8) is generally understood to make
physical changes.  Why not "tunefs -extents" to turn them on?  It's
completely analogous to "tunefs -J", will fit everyone's expectation,
and won't surprise people.  "mkfs -extents" does the same thing.

Joel

-- 

Life's Little Instruction Book #232

	"Keep your promises."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:50                       ` Dave Jones
  2006-06-09 21:09                         ` Joel Becker
@ 2006-06-09 21:32                         ` Jeff Garzik
  2006-06-09 22:56                           ` Andreas Dilger
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:32 UTC (permalink / raw)
  To: Dave Jones
  Cc: Theodore Tso, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

Dave Jones wrote:
> Am I missing something fundamental that precludes the use of both
> extent-based and current existing filesystems from the same code
> simultaneously ?

Nothing precludes it.  The point is that introducing major format 
changes inline in this manner just complicates the code progressively to 
the point where your directory walking, inode block walking, and other 
code winds up being

	if (new)
		...
	else
		...

_anyway_, at which point it is essentially multiple independent 
filesystems.  I guarantee this won't be the last fundamental fs metadata 
design change people will want to make...

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:29                                           ` Joel Becker
@ 2006-06-09 21:33                                             ` Alex Tomas
  2006-06-09 21:43                                             ` Joel Becker
  1 sibling, 0 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-09 21:33 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, cmm, linux-fsdevel, Andreas Dilger, Alan Cox

>>>>> Joel Becker (JB) writes:

 JB> On Sat, Jun 10, 2006 at 01:20:31AM +0400, Alex Tomas wrote:
 >> two point here:
 >> a) warnings should be made visible at mount time,
 >> something like printk(KERN_CRIT ...)

 JB> 	Too late, they're already broken!

not at mount time, only upon first file creation.

thanks, Alex

PS. need to think about your tune2fs/mke2fs proposal ... thanks.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:24                         ` Joel Becker
@ 2006-06-09 21:36                           ` Chase Venters
  2006-06-09 21:51                           ` Theodore Tso
  1 sibling, 0 replies; 296+ messages in thread
From: Chase Venters @ 2006-06-09 21:36 UTC (permalink / raw)
  To: Joel Becker
  Cc: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel, Andreas Dilger

On Fri, 9 Jun 2006, Joel Becker wrote:

> 	Heck, forget the name, just make the breakage more explicit.  Do
> it at mkfs/tunefs time.  "tunefs -extents" or "mkfs -t ext3 -extents".
> A mount option assumes that you can do with or without it.  If you do it
> once, you can mount the next time without it and stuff Just Works.  Even
> htree follows this.  A clean unmount leaves a clean directory structure
> that a non-htree driver can use.

I suggested this somewhere back in the thread and it got no play. What's 
the problem with doing things this way? (Aside from it being a compromise 
that doesn't automatically result in a new ext4)

Of course, there are a few debates going on here. Only one of them is 
about compatibility.

>
> Joel
>

Cheers,
Chase

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:35               ` Theodore Tso
@ 2006-06-09 21:41                 ` Jeff Garzik
  2006-06-09 21:45                   ` [Ext2-devel] " Michael Poole
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:41 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, ext2-devel, linux-kernel, Michael Poole,
	Christoph Hellwig, Gerrit Huizenga, cmm, linux-fsdevel

Theodore Tso wrote:
> And I'd also dispute with your "weren't really suited for the original
> ext2-style design" comment.  Ext2/3 was always designed to be
> extensible from the start, and we've successfully added features quite
> successfully for quite a while.

Although not the only disk format change, extents are a pretty big one. 
Will this be the last major on-disk format change?


>> Rather than taking another decade to slowly fix ext2 design decisions, 
>> why not move the process along a bit more rapidly?  Release early, 
>> release often...
> 
> I don't think it will be another decade, but yes, regardless of
> whether we do a code fork or not, it will take time.  Basically, you
> and the ext2 developers have a disagreement about whether or not a
> code fork will actually move the process along more quickly or not.
> Either way, we will be releasing early and often, so people can test
> it out and comment on it.  Releasing patches to LKML is just the first
> step in this process.

I don't see how a larger filesystem codebase could possibly move more 
quickly than a smaller codebase.  You'd have twice as many code paths to 
worry about.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:35           ` Andrew Morton
  2006-06-09 17:48             ` Jeff Garzik
@ 2006-06-09 21:42             ` Sonny Rao
  2006-06-09 22:15               ` Andrew Morton
  1 sibling, 1 reply; 296+ messages in thread
From: Sonny Rao @ 2006-06-09 21:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeff Garzik, hch, cmm, linux-kernel, ext2-devel, linux-fsdevel

On Fri, Jun 09, 2006 at 10:35:43AM -0700, Andrew Morton wrote:
<snip> 
> All that being said, Linux's filesystems are looking increasingly crufty
> and we are getting to the time where we would benefit from a greenfield
> start-a-new-one.  

I'm curious about this comment; in what way are they _collectively_
looking crufty ? 



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:29                                           ` Joel Becker
  2006-06-09 21:33                                             ` Alex Tomas
@ 2006-06-09 21:43                                             ` Joel Becker
  1 sibling, 0 replies; 296+ messages in thread
From: Joel Becker @ 2006-06-09 21:43 UTC (permalink / raw)
  To: Alex Tomas, Jeff Garzik, Alan Cox, Chase Venters, Andreas Dilger,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

On Fri, Jun 09, 2006 at 02:29:05PM -0700, Joel Becker wrote:
> 	However, tunefs(8) and mkfs(8) is generally understood to make
> physical changes.  Why not "tunefs -extents" to turn them on?  It's
> completely analogous to "tunefs -J", will fit everyone's expectation,
> and won't surprise people.  "mkfs -extents" does the same thing.

	Heck, if you have code to convert extents back to regular ext3,
"tunefs -noextents" works and is properly symmetric.

Joel

-- 

"The nice thing about egotists is that they don't talk about other
 people."
         - Lucille S. Harper

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:41                 ` Jeff Garzik
@ 2006-06-09 21:45                   ` Michael Poole
  2006-06-09 21:53                     ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Michael Poole @ 2006-06-09 21:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Gerrit Huizenga, Andrew Morton, ext2-devel,
	linux-kernel, Christoph Hellwig, cmm, linux-fsdevel

Jeff Garzik writes:

> Theodore Tso wrote:
> > And I'd also dispute with your "weren't really suited for the original
> > ext2-style design" comment.  Ext2/3 was always designed to be
> > extensible from the start, and we've successfully added features quite
> > successfully for quite a while.
> 
> Although not the only disk format change, extents are a pretty big
> one. Will this be the last major on-disk format change?

You keep making "straw that broke the camel's back" type arguments
without saying why this particular straw (rather than the other
compatibility-breaking features that are already in ext3) is the one
that must not be allowed.  Is it a matter of taste, or is there some
objective threshold that extents cross?

> >> Rather than taking another decade to slowly fix ext2 design
> >> decisions, why not move the process along a bit more rapidly?
> >> Release early, release often...
> > I don't think it will be another decade, but yes, regardless of
> > whether we do a code fork or not, it will take time.  Basically, you
> > and the ext2 developers have a disagreement about whether or not a
> > code fork will actually move the process along more quickly or not.
> > Either way, we will be releasing early and often, so people can test
> > it out and comment on it.  Releasing patches to LKML is just the first
> > step in this process.
> 
> I don't see how a larger filesystem codebase could possibly move more
> quickly than a smaller codebase.  You'd have twice as many code paths
> to worry about.

This is also the case when you cut and paste an entire filesystem's
source code, as has been mentioned several times in this thread.

Michael Poole

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:52                 ` Stephen C. Tweedie
@ 2006-06-09 21:47                   ` Jeff Garzik
  2006-06-10  0:41                     ` James Morris
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:47 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Alex Tomas, Andrew Morton, ext2-devel@lists.sourceforge.net,
	linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
	Andreas Dilger

Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, 2006-06-09 at 11:56 -0400, Jeff Garzik wrote:
> 
>> Think about how this will be deployed in production, long term.
>>
>> If extents are not made default at some point, then no one will use the 
>> feature, and it should not be merged.
> 
> Features such as ACLs and SELinux are still not on by default and are
> most *definitely* used.  This is a bogus argument.

They are on in SElinux-enabled distro installs, AFAIK?


>> And when extents are default, you have this blizzard-of-feature-flags 
>> stealth upgrade event occur _sometime_ after they boot into the new fs 
>> for the first time.
> 
> No.  I don't see it ever being forced on in the kernel by default, so
> there will be no such "stealth upgrades".
> 
> Rather, if it is "made default", that will be done by setting the flag
> by default on newly-created filesystems in mke2fs.  We won't be playing
> magic on existing filesystems.
> 
> And to avoid confusion, I am *entirely* open to the idea of making it
> only ever default to on in mke2fs at some point in the future where we
> batch a set of incompat features with the "ext4" label, so that "mke2fs
> -O ext4", or "mke4fs", would set it.  That has already been proposed on
> ext2-devel; we're nowhere near the stage of making that default yet.

Sure.  And why not bundle that with a vehicle for separating out the 
_code_ that deals with ancient formats versus newer formats.  A vehicle 
that enables the existing ext3 stuff to stabilize and freeze, while 
enabling parallel development of new features.

	Jeff



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:57                             ` Stephen C. Tweedie
@ 2006-06-09 21:49                               ` Jeff Garzik
  2006-06-09 21:55                                 ` [Ext2-devel] " Stephen C. Tweedie
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:49 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrew Morton, Theodore Ts'o, Matthew Frost,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, 2006-06-09 at 16:04 -0400, Jeff Garzik wrote:
> 
>> Consider a blkdev of size S1.  Using LVM we increase that value under 
>> the hood to size S2, where S2 > S1.  We perform an online resize from 
>> size S1 to S2.  The size and alignment of any new groups added will 
>> different from the non-resize case, where mke2fs was run directly on a 
>> blkdev of size S2.
> 
> No, they won't.  We simply grow the last block group in the filesystem
> up to the size where we'd naturally add another block group anyway; and
> then, we add another block group exactly where it would have been on a
> fresh mkfs.

Yes but the inodes per group etc. would differ.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:24                         ` Joel Becker
  2006-06-09 21:36                           ` [Ext2-devel] " Chase Venters
@ 2006-06-09 21:51                           ` Theodore Tso
  2006-06-09 22:07                             ` Joel Becker
  1 sibling, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 21:51 UTC (permalink / raw)
  To: Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 02:24:10PM -0700, Joel Becker wrote:
> 	Heck, forget the name, just make the breakage more explicit.  Do
> it at mkfs/tunefs time.  "tunefs -extents" or "mkfs -t ext3 -extents".
> A mount option assumes that you can do with or without it.  If you do it
> once, you can mount the next time without it and stuff Just Works.  Even
> htree follows this.  A clean unmount leaves a clean directory structure
> that a non-htree driver can use.

Agreed; I've was never a fan of how we enabled extended attributes
using a mount option, as it clutters the /etc/fstab line, among other
things.  (I added the tune2fs -o feature so that default mount options
could be stored in the superblock to try to cover that design botch,
but the real answer is that extended attributes should never have been
done via a mount option, or at least not only as the right only thing
you had to do before the feature became enabled.)

The right approach is what we did with journaling (tune2fs -j or
tune2fs -O has_journal) and what we did with htree support (tune2fs -O
dir_index), to explicitly enable that feature, and that's certainly
what I will be pushing for.

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:09                         ` Joel Becker
@ 2006-06-09 21:51                           ` Mike Snitzer
  0 siblings, 0 replies; 296+ messages in thread
From: Mike Snitzer @ 2006-06-09 21:51 UTC (permalink / raw)
  To: Dave Jones, Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton,
	ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
	Andreas Dilger

On 6/9/06, Joel Becker <Joel.Becker@oracle.com> wrote:
> On Fri, Jun 09, 2006 at 04:50:36PM -0400, Dave Jones wrote:
> > On Fri, Jun 09, 2006 at 01:38:03PM -0700, Joel Becker wrote:
> >  > that the older code cannot read.  Alex claims people just shouldn't use
> >  > "-o extents", but the fact is their distro will choose it for them.
> >
> > .. on partitions over a certain size, which couldn't be read with
> > older ext3 filesystems _anyway_
>
>         Certainly that would be fine.  Is that what will actually
> happen?  Experience says no.  Even if you get it right in your distro,
> not all distros will.  Heck, can you promise me that your distro will
> provide e2fsprogs updates to its older releases so that multiboot will
> continue to work?

If the kernel were bound by all the stakeholders' ability to _always_
"do the right thing" very little innovation would be possible.  These
tenuous arguments of hypothetical (ab)users are tiresome.

If the distro vendor did default to ext3+extents and it screwed your
hypothetical extents-naive user (booting a non-vendor kernel isn't
something your mom is going to do) then they strayed too far from
their Linux comfort-zone. If worst came to worst _THE UPDATED
EXT3UTILS WOULD PREVENT MOUNTING AN EXT3 FS WITH AN INCOMPATIBLE
FEATURE_.  God forbid the naive-user get an error when they try
something they shouldn't.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:45                   ` [Ext2-devel] " Michael Poole
@ 2006-06-09 21:53                     ` Jeff Garzik
  2006-06-09 22:04                       ` Theodore Tso
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:53 UTC (permalink / raw)
  To: Michael Poole
  Cc: Theodore Tso, Gerrit Huizenga, Andrew Morton, ext2-devel,
	linux-kernel, Christoph Hellwig, cmm, linux-fsdevel

Michael Poole wrote:
> Jeff Garzik writes:
> 
>> Theodore Tso wrote:
>>> And I'd also dispute with your "weren't really suited for the original
>>> ext2-style design" comment.  Ext2/3 was always designed to be
>>> extensible from the start, and we've successfully added features quite
>>> successfully for quite a while.
>> Although not the only disk format change, extents are a pretty big
>> one. Will this be the last major on-disk format change?
> 
> You keep making "straw that broke the camel's back" type arguments
> without saying why this particular straw (rather than the other
> compatibility-breaking features that are already in ext3) is the one
> that must not be allowed.  Is it a matter of taste, or is there some
> objective threshold that extents cross?

Yes, it's not a small change to the on-disk format.

If you write tools that read an ext3 filesystem, you won't be able to 
read file data at all, without updating your code.

That's a much bigger deal than say 32-bit uids.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:49                               ` Jeff Garzik
@ 2006-06-09 21:55                                 ` Stephen C. Tweedie
  2006-06-09 23:44                                   ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 21:55 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Theodore Ts'o, Matthew Frost,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas, Stephen Tweedie

Hi,

On Fri, 2006-06-09 at 17:49 -0400, Jeff Garzik wrote:

> >> Consider a blkdev of size S1.  Using LVM we increase that value under 
> >> the hood to size S2, where S2 > S1.  We perform an online resize from 
> >> size S1 to S2.  The size and alignment of any new groups added will 
> >> different from the non-resize case, where mke2fs was run directly on a 
> >> blkdev of size S2.
> > 
> > No, they won't.  We simply grow the last block group in the filesystem
> > up to the size where we'd naturally add another block group anyway; and
> > then, we add another block group exactly where it would have been on a
> > fresh mkfs.
> 
> Yes but the inodes per group etc. would differ.

No, we add the same number of inodes in the new groups that all the
previous groups have.

--Stephen



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:53                     ` Jeff Garzik
@ 2006-06-09 22:04                       ` Theodore Tso
  0 siblings, 0 replies; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 22:04 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Michael Poole,
	Christoph Hellwig, Gerrit Huizenga, cmm, linux-fsdevel

On Fri, Jun 09, 2006 at 05:53:14PM -0400, Jeff Garzik wrote:
> Yes, it's not a small change to the on-disk format.
> 
> If you write tools that read an ext3 filesystem, you won't be able to 
> read file data at all, without updating your code.

Most tools that read an ext2/3 filesystem directly use the libext2fs
library, and it will definitely be the case that for files smaller
than 4TB, even on a filesystem with extents enabled, as long as you
are using a version of libext2fs which is extents-aware, it will work
without any changes.

For files larger than 4TB, we will need some kind of LFS-like
interface change (i.e., ext2fs_file_llseek64 vs. ext2fs_file_llseek),
but that should be the only change needed by the tool.

					- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:51                           ` Theodore Tso
@ 2006-06-09 22:07                             ` Joel Becker
  2006-06-09 22:31                               ` [Ext2-devel] " Theodore Tso
  0 siblings, 1 reply; 296+ messages in thread
From: Joel Becker @ 2006-06-09 22:07 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 05:51:37PM -0400, Theodore Tso wrote:
> The right approach is what we did with journaling (tune2fs -j or
> tune2fs -O has_journal) and what we did with htree support (tune2fs -O
> dir_index), to explicitly enable that feature, and that's certainly
> what I will be pushing for.

	Excellent.  And now let's close the other side of compatibility.
The attribute problem we discussed with e2fsck has a simple solution:
exit cleanly when you don't understand a filesystem.
	If e2fsck finds an INCOMPAT flag it doesn't understand, it
didn't *fail* to fsck, it just plain doesn't understand the filesystem.
This should not, in any way, prevent bootup from continuing.  Later,
mount may succeed (if the kernel is new enough) or fail (if not), but my
system won't be completely unusable by surprise (assuming that / isn't
the affected filesystem).

Joel

-- 

Bram's Law:
	The easier a piece of software is to write, the worse it's
	implemented in practice.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:42             ` Sonny Rao
@ 2006-06-09 22:15               ` Andrew Morton
  2006-06-09 23:11                 ` Andreas Dilger
  2006-06-10  3:49                 ` Nathan Scott
  0 siblings, 2 replies; 296+ messages in thread
From: Andrew Morton @ 2006-06-09 22:15 UTC (permalink / raw)
  To: Sonny Rao; +Cc: jeff, hch, cmm, linux-kernel, ext2-devel, linux-fsdevel

Sonny Rao <sonny@burdell.org> wrote:
>
> On Fri, Jun 09, 2006 at 10:35:43AM -0700, Andrew Morton wrote:
> <snip> 
> > All that being said, Linux's filesystems are looking increasingly crufty
> > and we are getting to the time where we would benefit from a greenfield
> > start-a-new-one.  
> 
> I'm curious about this comment; in what way are they _collectively_
> looking crufty ? 

We seem to be lagging behind "the industry" in some areas - handling large
devices, high bandwidth IO, sophisticated on-disk data structures, advanced
manageability, etc.

I mean, although ZFS is a rampant layering violation and we can do a lot of
the things in there (without doing it all in the fs!) I don't think we can
do all of it.

We're continuing to nurse along a few basically-15-year-old filesystems
while we do have the brains, manpower and processes to implement a new,
really great one.

It's just this feeling I have ;)

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:07                             ` Joel Becker
@ 2006-06-09 22:31                               ` Theodore Tso
  2006-06-09 22:47                                 ` Joel Becker
  0 siblings, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 22:31 UTC (permalink / raw)
  To: Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 03:07:11PM -0700, Joel Becker wrote:
> 	Excellent.  And now let's close the other side of compatibility.
> The attribute problem we discussed with e2fsck has a simple solution:
> exit cleanly when you don't understand a filesystem.
> 	If e2fsck finds an INCOMPAT flag it doesn't understand, it
> didn't *fail* to fsck, it just plain doesn't understand the filesystem.
> This should not, in any way, prevent bootup from continuing.  Later,
> mount may succeed (if the kernel is new enough) or fail (if not), but my
> system won't be completely unusable by surprise (assuming that / isn't
> the affected filesystem).

The potential problem with this is that system administrator may never
realize that the filesystem is just getting silently skipped.  (And a
big fat warning printed by e2fsck doesn't help when distro's like
Ubuntu use a graphical boot sequence that hides warning messages
printed by e2fsck).

Is it really that hard to edit /etc/fstab so that the fsck pass is
skipped?  

I might be willing to make it be a configurable option in
/etc/e2fsck.conf, but it *is* dangerous to have e2fsck exit with
success without having actually checked the filesystem.

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:04                           ` Jeff Garzik
  2006-06-09 20:57                             ` Stephen C. Tweedie
@ 2006-06-09 22:37                             ` Andreas Dilger
  1 sibling, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 22:37 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Matthew Frost, Alex Tomas, Linus Torvalds,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

On Jun 09, 2006  16:04 -0400, Jeff Garzik wrote:
> Theodore Tso wrote:
> > And what ugly hacks are you talking about?  It's actually quite clean;
> > with the latest e2fsprogs, you use the same command (resize2fs) for
> > doing both online and offline resizing.
> 
> Consider a blkdev of size S1.  Using LVM we increase that value under 
> the hood to size S2, where S2 > S1.  We perform an online resize from 
> size S1 to S2.  The size and alignment of any new groups added will 
> different from the non-resize case, where mke2fs was run directly on a 
> blkdev of size S2.

Umm, and how is that a problem?  Either you want online resizing because
it provides some useful functionality, or you don't want it because you
are concerned with something that nobody else in the world is.  In the
latter case, don't use it.  Even if the metadata alignment is slightly
different on disk doesn't make it in any way an invalid filesystem.  In
fact, online resizing is 100% compatible after the resize back to the
dark ages of linux.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:31                               ` [Ext2-devel] " Theodore Tso
@ 2006-06-09 22:47                                 ` Joel Becker
  2006-06-09 23:54                                   ` [Ext2-devel] " Theodore Tso
  0 siblings, 1 reply; 296+ messages in thread
From: Joel Becker @ 2006-06-09 22:47 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 06:31:29PM -0400, Theodore Tso wrote:
> The potential problem with this is that system administrator may never
> realize that the filesystem is just getting silently skipped.  (And a
> big fat warning printed by e2fsck doesn't help when distro's like
> Ubuntu use a graphical boot sequence that hides warning messages
> printed by e2fsck).

	Yeah, you're not the only one to point this out.

> Is it really that hard to edit /etc/fstab so that the fsck pass is
> skipped?  

	Depends on how close you are in proximity to the console, I
suspect.  Point is, something _broke_.
	Regardless of the name, clearly we have a _different_
filesystem.  With a clean unmount, a journaled ext3 is still a valid
ext2.  With a clean unmount, an extended-attribute ext3 is still a valid
plain ext3 and a valid ext2.  With a clean unmount, a dir_index ext3 is
still a valid plain ext3 and a valid ext2.  An extents ext3 is NEVER a
valid plain ext3 or ext2.
	What happens today if you have a filesystem in fstab that
has no fsck in /sbin (eg, we all pick the name 'ext4', it says 'ext4' in
fstab, but there is no /sbin/fsck.ext4)?  Does "fsck -a" skip the
partition, or halt and fail the boot?  If the latter, I suspect that the
only solution is "I hope you don't encounter this on remote machines ha
ha ha".  If it skips, we have yet another reason that using the same
name is a bad thing.

Joel

-- 

"Sometimes when reading Goethe I have the paralyzing suspicion
 that he is trying to be funny."
         - Guy Davenport

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:22                     ` Alex Tomas
  2006-06-09 19:23                       ` Jeff Garzik
@ 2006-06-09 22:49                       ` Valdis.Kletnieks
  2006-06-09 23:34                         ` [Ext2-devel] " Andreas Dilger
  1 sibling, 1 reply; 296+ messages in thread
From: Valdis.Kletnieks @ 2006-06-09 22:49 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, hch, cmm,
	linux-fsdevel


[-- Attachment #1.1: Type: text/plain, Size: 337 bytes --]

On Fri, 09 Jun 2006 23:22:23 +0400, Alex Tomas said:
> what if proposed patch is safer than an average fix?
> (given that it's just out of usage unless enabled)

Those are the *dangerous* patches, because they usually contain bugs
that weren't tripped over by the 6 people who enabled it while it
was bouncing around in the -mm tree....

[-- Attachment #1.2: Type: application/pgp-signature, Size: 226 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 161 bytes --]

_______________________________________________
Ext2-devel mailing list
Ext2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ext2-devel

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:33                 ` Alex Tomas
  2006-06-09 16:37                   ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 22:52                   ` Valdis.Kletnieks
  2006-06-09 23:21                     ` Andreas Dilger
  1 sibling, 1 reply; 296+ messages in thread
From: Valdis.Kletnieks @ 2006-06-09 22:52 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Mike Snitzer, Jeff Garzik, Christoph Hellwig, Mingming Cao,
	linux-kernel, ext2-devel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

On Fri, 09 Jun 2006 20:33:18 +0400, Alex Tomas said:

> one who needs/wants to go back may get rid of extents by:
> a) remounting w/o extents option
> b) copying new-fashion-style files so that copies use blockmap
> c) dropping extents feature in superblock

OK.. Obviously my brain is tiny and easily overfilled.

Given that the whole alledged problem with extents is that they're not
backward compatible, how do you read the files in (b) so that you can copy
them, if the data is in the non-compatible extents that you can't read because
you've disabled extents?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:32                         ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 22:56                           ` Andreas Dilger
  2006-06-09 23:06                             ` Linus Torvalds
  2006-06-09 23:09                             ` Jeff Garzik
  0 siblings, 2 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 22:56 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Dave Jones, Theodore Tso, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel

On Jun 09, 2006  17:32 -0400, Jeff Garzik wrote:
> Dave Jones wrote:
> >Am I missing something fundamental that precludes the use of both
> >extent-based and current existing filesystems from the same code
> >simultaneously ?
> 
> Nothing precludes it.  The point is that introducing major format 
> changes inline in this manner just complicates the code progressively to 
> the point where your directory walking, inode block walking, and other 
> code winds up being
> 
> 	if (new)
> 		...
> 	else
> 		...
> 
> _anyway_, at which point it is essentially multiple independent 
> filesystems.  I guarantee this won't be the last fundamental fs metadata 
> design change people will want to make...

Umm, and how is this fundamentally different from similar code paths in
the VFS (e.g. O_DIRECT vs regular writes)?  Should we make a copy of the
whole write path for each of O_DIRECT, AIO, pwrite, etc writes, or should
we instead add in a small change to the write path than leverages the
majority of the existing code?

What is better, using the 95% of the VFS that is common and change 5% to
work with the filesystem, or duplicate the whole VFS just because 5%
needs to be different?

In the extents case, the large majority of the ext3 code is the same
(directory, inode handling, superblock, etc) and only the on-disk format
for indirect blocks has changed.  Yes, we also want to change the block
allocator next in order to improve the performance in conjunction with
extents, but that is purely an in-memory change that has no direct
relation to on-disk layout.  The major motivations for the extents format:
(a) more compact on-disk representation for large files (improves unlink
    performance, reduces memory usage for metadata)
(b) support for larger filesystems (which will affect everyone soon enough).
(c) integrate well with improved allocation support

For most of the ext3 developers (b) is the primary motivation here, and
given that so many people are vocal about ext3 changes that must mean
that there are a lot of ext3 users here.  Does that mean that in the next
few years all of the objectors will abandon ext3 in favour of ext4 or XFS
or JFS or reiserfs or reiser4 when you get a new system with a single 12TB
disk?  And we can delete ext3 then, or will you be happy then that ext3
supports these large disks without any effort on your part?

Maybe we should start by deleting ext2 because it is old and obsolete?
The reality is that we will never merge the forks back once they are made.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:56                           ` Andreas Dilger
@ 2006-06-09 23:06                             ` Linus Torvalds
  2006-06-09 23:09                             ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 23:06 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrew Morton, Theodore Tso, Jeff Garzik, ext2-devel,
	linux-kernel, cmm, linux-fsdevel, Dave Jones, Alex Tomas

On Fri, 9 Jun 2006, Andreas Dilger wrote:
> 
> Umm, and how is this fundamentally different from similar code paths in
> the VFS (e.g. O_DIRECT vs regular writes)?

That's a great argument.

You're arguing that your thing is good by pointing to a known disaster 
area and saying "but that other thing does it too, so it must be good".

O_DIRECT is a piece of crap, and I'm still sorry that I didn't push back 
enough on it. And I _did_ push back on it. But I finally was worn down.

And yes, part of the problem is exactly that it uses _almost_ the same 
paths, but not quite. 

Oh, well.

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:56                           ` Andreas Dilger
  2006-06-09 23:06                             ` Linus Torvalds
@ 2006-06-09 23:09                             ` Jeff Garzik
  2006-06-09 23:37                               ` [Ext2-devel] " Andreas Dilger
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 23:09 UTC (permalink / raw)
  To: Jeff Garzik, Dave Jones, Theodore Tso, Alex Tomas, Andrew Morton,
	ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel

Andreas Dilger wrote:
> Maybe we should start by deleting ext2 because it is old and obsolete?
> The reality is that we will never merge the forks back once they are made.

We _already have_ a relevant example:  ext2 -> ext3.

A useful fork is in the tree, and you're working on it.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:15               ` Andrew Morton
@ 2006-06-09 23:11                 ` Andreas Dilger
  2006-06-09 23:15                   ` Jeff Garzik
  2006-06-10  3:37                   ` Valerie Henson
  2006-06-10  3:49                 ` Nathan Scott
  1 sibling, 2 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: jeff, ext2-devel, linux-kernel, hch, cmm, linux-fsdevel,
	Sonny Rao

On Jun 09, 2006  15:15 -0700, Andrew Morton wrote:
> We seem to be lagging behind "the industry" in some areas - handling large
> devices, high bandwidth IO, sophisticated on-disk data structures, advanced
> manageability, etc.
> 
> I mean, although ZFS is a rampant layering violation and we can do a lot of
> the things in there (without doing it all in the fs!) I don't think we can
> do all of it.
> 
> We're continuing to nurse along a few basically-15-year-old filesystems
> while we do have the brains, manpower and processes to implement a new,
> really great one.
> 
> It's just this feeling I have ;)

I think many people share this feeling (me included), hence the linux
filesystem meeting next week...  The problem is that even getting a
half-decent disk filesystem is many years of work, and large disks are
here before then.  The ZFS code took 10 years to get to its current state,
I understand, so I don't anticipate we will get there overnight.

The question is whether we can get to this state more easily by starting
on a known-good base (ext3) or by starting from scratch.  My opinion is
strongly in the "start from a known-good base" camp, and make incremental
improvements to that base instead of discarding everything and starting
again.

I think the real frontier for future filesystem development is in the
ZFS direction where the filesystem can be robust in the face of data
errors without having a single fail-stop mode of error handling.  While
ext2 and ext3 have been OK in this regard they can definitely be improved
without discarding the rest of the code and the millions of hours of
testing that has gone into it.

I'm not so strongly against ext4 that I won't follow that route if needed,
but it essentially means that ext3 will be orphaned.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 23:11                 ` Andreas Dilger
@ 2006-06-09 23:15                   ` Jeff Garzik
  2006-06-10  3:37                   ` Valerie Henson
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 23:15 UTC (permalink / raw)
  To: Andrew Morton, Sonny Rao, jeff, hch, cmm, linux-kernel,
	ext2-devel, linux-fsdevel

Andreas Dilger wrote:
> I'm not so strongly against ext4 that I won't follow that route if needed,
> but it essentially means that ext3 will be orphaned.

Not orphaned but scaled back over time.  IMO there's only so much 
developer and brain and test bandwidth for "the main Linux filesystem."

	Jeff



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:52                   ` Valdis.Kletnieks
@ 2006-06-09 23:21                     ` Andreas Dilger
  2006-06-10  1:21                       ` Valdis.Kletnieks
  0 siblings, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 23:21 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Alex Tomas, Jeff Garzik, ext2-devel, linux-kernel,
	Christoph Hellwig, Mingming Cao, linux-fsdevel

On Jun 09, 2006  18:52 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 09 Jun 2006 20:33:18 +0400, Alex Tomas said:
> > one who needs/wants to go back may get rid of extents by:
> > a) remounting w/o extents option
> > b) copying new-fashion-style files so that copies use blockmap
> > c) dropping extents feature in superblock
> 
> OK.. Obviously my brain is tiny and easily overfilled.

...

> Given that the whole alledged problem with extents is that they're not
> backward compatible, how do you read the files in (b) so that you can copy
> them, if the data is in the non-compatible extents that you can't read because
> you've disabled extents?

You mount with the new kernel without "-o extents", and find files with
extents "lsattr -R /mnt/tmp | awk '/----e / print { $2 }'", copy those
files, mv over old files, unmount.

A similar thing is necessary for ext3 filesystems before you can mount them
as ext2 - they can't be mounted as ext2 until the journal is recovered
(an unrecovered journal is an incompatible feature).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:49                       ` Valdis.Kletnieks
@ 2006-06-09 23:34                         ` Andreas Dilger
  0 siblings, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 23:34 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Alex Tomas, Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	hch, cmm, linux-fsdevel

On Jun 09, 2006  18:49 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 09 Jun 2006 23:22:23 +0400, Alex Tomas said:
> > what if proposed patch is safer than an average fix?
> > (given that it's just out of usage unless enabled)
> 
> Those are the *dangerous* patches, because they usually contain bugs
> that weren't tripped over by the 6 people who enabled it while it
> was bouncing around in the -mm tree....

Umm, in case you didn't know, the extent patch which is the primary issue
of discussion here (not the whole 64-bit clean changes though) were run
for MILLIONS of hours under very high IO load on the largest computer
systems in the world for the last year or so.  It is easy to get millions
of hours of usage if there are thousands of servers running this code...

Yes, I have no doubt there will be bugs in the code because the usage
pattern is different for different environments, but we aren't advocating
the inclusion of something major like this that was just written
yesterday in someone's basement.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 23:09                             ` Jeff Garzik
@ 2006-06-09 23:37                               ` Andreas Dilger
  2006-06-09 23:54                                 ` Linus Torvalds
  0 siblings, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-09 23:37 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Dave Jones, Theodore Tso, Alex Tomas, Andrew Morton, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel

On Jun 09, 2006  19:09 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >Maybe we should start by deleting ext2 because it is old and obsolete?
> >The reality is that we will never merge the forks back once they are made.
> 
> We _already have_ a relevant example:  ext2 -> ext3.
> 
> A useful fork is in the tree, and you're working on it.

OK, you're right.  We'll continue working on the fork (namely ext3) and
when people who care consider those features stable enough they can port
them to ext2. :-)

Like another person pointed out - there are bugs that are fixed in ext3
that aren't in fixed ext2, and vice versa.  Even though the ext2 code
is basically dead, new bugs are still found in it.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:55                                 ` [Ext2-devel] " Stephen C. Tweedie
@ 2006-06-09 23:44                                   ` Jeff Garzik
  2006-06-10  0:45                                     ` [Ext2-devel] " Andreas Dilger
  2006-06-10  0:47                                     ` Theodore Tso
  0 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 23:44 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrew Morton, Theodore Ts'o, Matthew Frost,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, 2006-06-09 at 17:49 -0400, Jeff Garzik wrote:
> 
>>>> Consider a blkdev of size S1.  Using LVM we increase that value under 
>>>> the hood to size S2, where S2 > S1.  We perform an online resize from 
>>>> size S1 to S2.  The size and alignment of any new groups added will 
>>>> different from the non-resize case, where mke2fs was run directly on a 
>>>> blkdev of size S2.
>>> No, they won't.  We simply grow the last block group in the filesystem
>>> up to the size where we'd naturally add another block group anyway; and
>>> then, we add another block group exactly where it would have been on a
>>> fresh mkfs.
>> Yes but the inodes per group etc. would differ.
> 
> No, we add the same number of inodes in the new groups that all the
> previous groups have.

Yes.  Re-read what I wrote.  To put it another way, "mkfs S1 + resize to 
S2" does not produce precisely the same layout as "mkfs S2".

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:03                       ` Theodore Tso
  2006-06-09 21:24                         ` Joel Becker
@ 2006-06-09 23:48                         ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-09 23:48 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

Theodore Tso wrote:
> I suspect that Red Hat has learned from that past experience, and
> won't be making that mistake again, at least without explicitly
> requesting the user's permission.  So how about we trust the
> distributions to be a bit more careful this time around?

Make the line of demarcation much more clear...

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 23:37                               ` [Ext2-devel] " Andreas Dilger
@ 2006-06-09 23:54                                 ` Linus Torvalds
  0 siblings, 0 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-09 23:54 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrew Morton, Theodore Tso, Jeff Garzik, ext2-devel,
	linux-kernel, cmm, linux-fsdevel, Dave Jones, Alex Tomas

On Fri, 9 Jun 2006, Andreas Dilger wrote:
>
> OK, you're right.  We'll continue working on the fork (namely ext3) and
> when people who care consider those features stable enough they can port
> them to ext2. :-)

You're totally inappropriately focused on this whole "porting back" side.

THE WHOLE POINT IS TO NOT PORT THINGS BACK.

There is absolutely no point in any ext4 work being ported back to ext3, 
since the whole point is a fork like this is to have the "stable" thing.

Yes, old bugs happen and sometimes exist in both, but hey, the number of 
duplicated bugs - while not non-zero - is still less than the bugs 
introduced by trying to keep things in sync.

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:47                                 ` Joel Becker
@ 2006-06-09 23:54                                   ` Theodore Tso
  0 siblings, 0 replies; 296+ messages in thread
From: Theodore Tso @ 2006-06-09 23:54 UTC (permalink / raw)
  To: Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 03:47:00PM -0700, Joel Becker wrote:
> 	What happens today if you have a filesystem in fstab that
> has no fsck in /sbin (eg, we all pick the name 'ext4', it says 'ext4' in
> fstab, but there is no /sbin/fsck.ext4)?  Does "fsck -a" skip the
> partition, or halt and fail the boot?  If the latter, I suspect that the
> only solution is "I hope you don't encounter this on remote machines ha
> ha ha".  

It will halt and fail the boot.

Of course, installing a kernel more recent on 2.6.14 or so a RHEL4
system when you have a SCSI controller such as MPT Fusion will also
cause the system to fail to boot unless you remember to compile it
directly into the kernel because of changes in semantics about whether
the SCSI probing happens before or after the module load completes ---
and the answer that has been given is "we don't care".  So these sorts
of traps have been around for people who are going back and forth
between the bleeding edge and distro systems, but I think we'd all
agree that this isn't necessarily the common case.

							- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 16:55                 ` Jeff Garzik
  2006-06-09 17:12                   ` [Ext2-devel] " Alex Tomas
  2006-06-09 19:57                   ` Theodore Tso
@ 2006-06-10  0:07                   ` Olivier Galibert
  2006-06-10  0:13                     ` Jeff Garzik
  2 siblings, 1 reply; 296+ messages in thread
From: Olivier Galibert @ 2006-06-10  0:07 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Alex Tomas, Andreas Dilger

On Fri, Jun 09, 2006 at 12:55:09PM -0400, Jeff Garzik wrote:
> Alex Tomas wrote:
> >so, instead of taking one (quite-well-tested) part that solves one of
> >the biggest ext3 limitation, you propose to start a new project and
> >get something in a year (probably) ?
> >
> >I think about extents as a step-by-step way ...
> 
> That is what the entirety of Linux development is -- step-by-step.
> 
> It is OBVIOUS that it would take five minutes to start ext4.
> 
> 1) clone a new tree
> 2) cp -a fs/ext3 fs/ext4
> 3) apply extent and 48bit patches
> 4) apply related e2fsprogs patches

5) force all options (attributes, etc) on and remove the flags
   indicating their existence from the metadata, you'll need the space
   for the fs evolution.

6) change the fs just enough so that an ext4 fs can never be mounted
   as ext3 or 2.

  OG.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  0:07                   ` Olivier Galibert
@ 2006-06-10  0:13                     ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10  0:13 UTC (permalink / raw)
  To: Olivier Galibert, Jeff Garzik, Alex Tomas, Linus Torvalds,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel,
	Andreas Dilger

Olivier Galibert wrote:
> On Fri, Jun 09, 2006 at 12:55:09PM -0400, Jeff Garzik wrote:
>> Alex Tomas wrote:
>>> so, instead of taking one (quite-well-tested) part that solves one of
>>> the biggest ext3 limitation, you propose to start a new project and
>>> get something in a year (probably) ?
>>>
>>> I think about extents as a step-by-step way ...
>> That is what the entirety of Linux development is -- step-by-step.
>>
>> It is OBVIOUS that it would take five minutes to start ext4.
>>
>> 1) clone a new tree
>> 2) cp -a fs/ext3 fs/ext4
>> 3) apply extent and 48bit patches
>> 4) apply related e2fsprogs patches
> 
> 5) force all options (attributes, etc) on and remove the flags
>    indicating their existence from the metadata, you'll need the space
>    for the fs evolution.
> 
> 6) change the fs just enough so that an ext4 fs can never be mounted
>    as ext3 or 2.

Yeah.  And its easy enough just to change the main magic number...

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 21:47                   ` [Ext2-devel] " Jeff Garzik
@ 2006-06-10  0:41                     ` James Morris
  0 siblings, 0 replies; 296+ messages in thread
From: James Morris @ 2006-06-10  0:41 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
	linux-fsdevel, Alex Tomas, Andreas Dilger

On Fri, 9 Jun 2006, Jeff Garzik wrote:

> Stephen C. Tweedie wrote:
> > Hi,
> > 
> > On Fri, 2006-06-09 at 11:56 -0400, Jeff Garzik wrote:
> > 
> > > Think about how this will be deployed in production, long term.
> > > 
> > > If extents are not made default at some point, then no one will use the
> > > feature, and it should not be merged.
> > 
> > Features such as ACLs and SELinux are still not on by default and are
> > most *definitely* used.  This is a bogus argument.
> 
> They are on in SElinux-enabled distro installs, AFAIK?

In RHEL & FC, SELinux xattrs are enabled by default, and acls need to be 
enabled via a mount option.


-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 23:44                                   ` Jeff Garzik
@ 2006-06-10  0:45                                     ` Andreas Dilger
  2006-06-10  0:47                                     ` Theodore Tso
  1 sibling, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10  0:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Stephen C. Tweedie, Andrew Morton, Theodore Ts'o,
	Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
	Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas

On Jun 09, 2006  19:44 -0400, Jeff Garzik wrote:
> Stephen C. Tweedie wrote:
> > No, we add the same number of inodes in the new groups that all the
> > previous groups have.
> 
> Yes.  Re-read what I wrote.  To put it another way, "mkfs S1 + resize to 
> S2" does not produce precisely the same layout as "mkfs S2".

And in what way is that important?  I mean, really, if this is your argument
that ext3 online resizing is a "hack" then it is pretty weak.  This does
not affect the operation or compatibility of the resized filesystem all the
way back to the stone age (i.e. every single ext2 kernel ever will work
with the resized filesystem).  That is why online resizing (and the resize
inode) are a COMPAT feature.

If I "cp b a /mnt/newfs" and "cp a b /mnt/newfs" "a" and "b" will have
different inode numbers too, but doesn't mean that "cp" is a "hack".

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 23:44                                   ` Jeff Garzik
  2006-06-10  0:45                                     ` [Ext2-devel] " Andreas Dilger
@ 2006-06-10  0:47                                     ` Theodore Tso
  2006-06-10  1:09                                       ` Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-10  0:47 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

On Fri, Jun 09, 2006 at 07:44:44PM -0400, Jeff Garzik wrote:
> Yes.  Re-read what I wrote.  To put it another way, "mkfs S1 + resize to 
> S2" does not produce precisely the same layout as "mkfs S2".

Different in the same way that "mke2fs -E stride=5" results a slightly
different location of where the block bitmaps, inode bitmaps, and
inode table might be, yes --- but SO WHAT?  

There's a *reason* that the block group descriptors tell the kernel
where to find the block/inode bitmaps and the inode table.  They can
change due to bad blocks in the filesystem, or requests to subtly
change the layout to optimize various RAID layouts, for example.  And
exactly how the block/inode bitmaps would get laid out in response to
-E stride have also changed over time, depending on which version of
e2fsprogs, but ---- News flash!! --- it doesn't matter!!!

Jeff, you seem to think that the fact that the layout isn't precisely
the same after an on-line resizing is proof of something horrible, but
it isn't.  The exact location of filesystem metadata has never been
fixed, not in the past ten years of ext2/3 history, and this is not a
big deal.  It certainly isn't "proof" of on-line resizing being
something horrible, as you keep trying to claim, without any arguments
other than, "The layout is different!".  

Oh my, hide the women and children...

							- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:23       ` Michael Poole
  2006-06-09 18:55         ` Jeff Garzik
@ 2006-06-10  0:49         ` Sven-Haegar Koch
  2006-06-10  1:06           ` Theodore Tso
  1 sibling, 1 reply; 296+ messages in thread
From: Sven-Haegar Koch @ 2006-06-10  0:49 UTC (permalink / raw)
  To: Michael Poole
  Cc: Jeff Garzik, Andrew Morton, Christoph Hellwig, cmm, linux-kernel,
	ext2-devel, linux-fsdevel

On Fri, 9 Jun 2006, Michael Poole wrote:

> Jeff Garzik writes:
>
>> Andrew Morton wrote:
>>> Ted&co have been pretty good at avoiding compatibility problems.
>>
>> Well, extents and 48bit make that track record demonstrably worse.
>>
>> Users are now forced to remember that, if they write to their
>> filesystem after using either $mmver or $korgver kernels, they are
>> locked out of using older kernels.
>
> Users are also forced to remember that, if they use certain new
> distros or programs, they are locked out of using older kernels.  They
> are forced to remember that if they have certain newer hardware, they
> are locked out of using older kernels.  They are forced to remember
> that if they use ext3 (or XFS or JFS) _at all_ they are locked out of
> using older kernels.  Why single out this particular aspect of limited
> forward compatibility to harp on so much?

I see a different problem with "ext3 + extends is not ext3 anymore" when 
the feature goes mainstream:
- user with old distri, no extends in use, no kernel support for them
- user has some kind of problem
- uses new rescue disk (aka knoppix at the time of problem) - that then
   is current stuff, and certainly uses extents - fixes problem on disk
   (may be a simple as running lilo/grub from chroot, happens often for me)
- tries to boot back into his distri -> *boom* he lost

c'ya
sven

-- 

The Internet treats censorship as a routing problem, and routes around it.
(John Gilmore on http://www.cygnus.com/~gnu/)

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  0:49         ` Sven-Haegar Koch
@ 2006-06-10  1:06           ` Theodore Tso
  2006-06-10 14:07             ` Olivier Galibert
  0 siblings, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-10  1:06 UTC (permalink / raw)
  To: Sven-Haegar Koch
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Michael Poole, Christoph Hellwig, cmm, linux-fsdevel

On Sat, Jun 10, 2006 at 02:49:32AM +0200, Sven-Haegar Koch wrote:
> I see a different problem with "ext3 + extends is not ext3 anymore" when 
> the feature goes mainstream:
> - user with old distri, no extends in use, no kernel support for them
> - user has some kind of problem
> - uses new rescue disk (aka knoppix at the time of problem) - that then
>   is current stuff, and certainly uses extents - fixes problem on disk
>   (may be a simple as running lilo/grub from chroot, happens often for me)
> - tries to boot back into his distri -> *boom* he lost

Incorrect, because unless you explicitly enable the use of extents,
the mere act of using a new kernel such as might be found on knoppix
will not result in the filesystem utilizing the extent feature.

There's a lot FUD being spread by people who haven't been bothering to
understand what is being proposed, and that's disappointing.

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  0:47                                     ` Theodore Tso
@ 2006-06-10  1:09                                       ` Jeff Garzik
  2006-06-10  1:30                                         ` [Ext2-devel] " Andreas Dilger
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10  1:09 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Stephen C. Tweedie, Andrew Morton,
	Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
	Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas

Theodore Tso wrote:
> Jeff, you seem to think that the fact that the layout isn't precisely
> the same after an on-line resizing is proof of something horrible, but
> it isn't.  The exact location of filesystem metadata has never been
> fixed, not in the past ten years of ext2/3 history, and this is not a
> big deal.  It certainly isn't "proof" of on-line resizing being
> something horrible, as you keep trying to claim, without any arguments
> other than, "The layout is different!".  

No, I was proving merely that it is _different_.  And the values where 
you see a _difference_ are the ones of which are no longer sized 
optimally, after you grow the fs to a larger size.

So you incur a performance penalty for resizing to size S2, rather than 
mke2fs'ing the new blkdev at size S2.  Certainly within the confines of 
ext3 that cannot be helped, but a different inode allocation strategy 
could improve upon that.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 23:21                     ` Andreas Dilger
@ 2006-06-10  1:21                       ` Valdis.Kletnieks
  2006-06-10  2:09                         ` [Ext2-devel] " Andreas Dilger
  0 siblings, 1 reply; 296+ messages in thread
From: Valdis.Kletnieks @ 2006-06-10  1:21 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Jeff Garzik, ext2-devel, linux-kernel, Christoph Hellwig,
	Mingming Cao, linux-fsdevel, Alex Tomas

[-- Attachment #1.1: Type: text/plain, Size: 795 bytes --]

On Fri, 09 Jun 2006 17:21:08 MDT, Andreas Dilger said:

> You mount with the new kernel without "-o extents", and find files with
> extents "lsattr -R /mnt/tmp | awk '/----e / print { $2 }'", copy those
> files, mv over old files, unmount.

How do you "copy those files" when you don't have extent support at that
point?  Remember - the whole problem here is that if you don't have
extent support, you can't read the file, it's backward-incompatible.
(If you *are* able to read the file even without extents, then this whole
thread is total BS).

You can certainly at least try to copy them to another file system
while the source *is* mounted with -o extents, and then mount without it
and copy the files back, but (a) that isn't what you said and (b) it doesn't
work for files over 2T or so..

[-- Attachment #1.2: Type: application/pgp-signature, Size: 226 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

[-- Attachment #3: Type: text/plain, Size: 161 bytes --]

_______________________________________________
Ext2-devel mailing list
Ext2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ext2-devel

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10  1:09                                       ` Jeff Garzik
@ 2006-06-10  1:30                                         ` Andreas Dilger
  2006-06-10  1:43                                           ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10  1:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Stephen C. Tweedie, Andrew Morton, Matthew Frost,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

On Jun 09, 2006  21:09 -0400, Jeff Garzik wrote:
> Theodore Tso wrote:
> > Jeff, you seem to think that the fact that the layout isn't precisely
> > the same after an on-line resizing is proof of something horrible, but
> > it isn't.  The exact location of filesystem metadata has never been
> > fixed, not in the past ten years of ext2/3 history, and this is not a
> > big deal.  It certainly isn't "proof" of on-line resizing being
> > something horrible, as you keep trying to claim, without any arguments
> > other than, "The layout is different!".  
> 
> No, I was proving merely that it is _different_.  And the values where 
> you see a _difference_ are the ones of which are no longer sized 
> optimally, after you grow the fs to a larger size.

It sounds like you don't know what you are talking about, which is OK,
except that you keep harping on some non-existent point.

> So you incur a performance penalty for resizing to size S2, rather than 
> mke2fs'ing the new blkdev at size S2.  Certainly within the confines of 
> ext3 that cannot be helped, but a different inode allocation strategy 
> could improve upon that.

???  Can you please be specific in what the performance penalty is, and
what specifically is "not sized optimally" after a resize?  How exactly
does inode allocation strategy relate to anything at all to online resizing.

Given that Ted and I are both disagreeing with you, and we are the two
people who know the most about the online resizing code (SCT is also
in this same group), maybe you should just concede that you are incorrect
on this point and move on.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10  1:30                                         ` [Ext2-devel] " Andreas Dilger
@ 2006-06-10  1:43                                           ` Jeff Garzik
  2006-06-10  2:03                                             ` Theodore Tso
  2006-06-10  2:26                                             ` Andreas Dilger
  0 siblings, 2 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10  1:43 UTC (permalink / raw)
  To: Jeff Garzik, Theodore Tso, Stephen C. Tweedie, Andrew Morton,
	Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
	Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas

Andreas Dilger wrote:
> On Jun 09, 2006  21:09 -0400, Jeff Garzik wrote:
>> Theodore Tso wrote:
>>> Jeff, you seem to think that the fact that the layout isn't precisely
>>> the same after an on-line resizing is proof of something horrible, but
>>> it isn't.  The exact location of filesystem metadata has never been
>>> fixed, not in the past ten years of ext2/3 history, and this is not a
>>> big deal.  It certainly isn't "proof" of on-line resizing being
>>> something horrible, as you keep trying to claim, without any arguments
>>> other than, "The layout is different!".  
>> No, I was proving merely that it is _different_.  And the values where 
>> you see a _difference_ are the ones of which are no longer sized 
>> optimally, after you grow the fs to a larger size.
> 
> It sounds like you don't know what you are talking about, which is OK,
> except that you keep harping on some non-existent point.
> 
>> So you incur a performance penalty for resizing to size S2, rather than 
>> mke2fs'ing the new blkdev at size S2.  Certainly within the confines of 
>> ext3 that cannot be helped, but a different inode allocation strategy 
>> could improve upon that.
> 
> ???  Can you please be specific in what the performance penalty is, and
> what specifically is "not sized optimally" after a resize?  How exactly
> does inode allocation strategy relate to anything at all to online resizing.

Inodes per group / inode blocks per group, as I've already stated.

	Jeff



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  1:43                                           ` Jeff Garzik
@ 2006-06-10  2:03                                             ` Theodore Tso
  2006-06-10  2:11                                               ` [Ext2-devel] " Jeff Garzik
  2006-06-10  2:58                                               ` [Ext2-devel] " Jeff Garzik
  2006-06-10  2:26                                             ` Andreas Dilger
  1 sibling, 2 replies; 296+ messages in thread
From: Theodore Tso @ 2006-06-10  2:03 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

On Fri, Jun 09, 2006 at 09:43:14PM -0400, Jeff Garzik wrote:
> >???  Can you please be specific in what the performance penalty is, and
> >what specifically is "not sized optimally" after a resize?  How exactly
> >does inode allocation strategy relate to anything at all to online 
> >resizing.
> 
> Inodes per group / inode blocks per group, as I've already stated.

Nope!  Inodes per group and inode blocks per group are maintained
across an online resize.  So there is no difference in inodes per
group for a filesystem created at size S1 and resized to size S2
(using either an on-line or off-line resize), and a filesystem which
is created to be size S2.

As Andreas has said, "you don't know what you are talking about."

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10  1:21                       ` Valdis.Kletnieks
@ 2006-06-10  2:09                         ` Andreas Dilger
  2006-06-10  2:45                           ` Nicholas Miell
  0 siblings, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10  2:09 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Alex Tomas, Jeff Garzik, ext2-devel, linux-kernel,
	Christoph Hellwig, Mingming Cao, linux-fsdevel

On Jun 09, 2006  21:21 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 09 Jun 2006 17:21:08 MDT, Andreas Dilger said:
> > You mount with the new kernel without "-o extents", and find files with
> > extents "lsattr -R /mnt/tmp | awk '/----e / print { $2 }'", copy those
> > files, mv over old files, unmount.
> 
> How do you "copy those files" when you don't have extent support at that
> point?  Remember - the whole problem here is that if you don't have
> extent support, you can't read the file, it's backward-incompatible.
> (If you *are* able to read the file even without extents, then this whole
> thread is total BS).

The "-o extents" mount option only affects new files that are created
while that option is enabled.  It doesn't affect existing files (even if
they are modified while "-o extents" is set).  It also doesn't affect any
new files after "-o extents" is removed.  Also, directories will not
be extent-mapped, because their allocation pattern doesn't mix well with
extent-mapped files (i.e. they are mostly single-block allocations).

Files that are created with "-o extents" are of course only readable with
a kernel that supports it.  To be safe, the whole filesystem is marked
with an EXT3_FEATURE_INCOMPAT_EXTENTS flag when the first extent file
is created so that users don't inadvertently get strange errors while
accessing the inodes marked with EXT3_EXTENT_FL with an old kernel.
New kernels that understand INCOMPAT_EXTENTS of course can access extent
and non-extent files equally well.

In an emergency it would also be possible to remove the INCOMPAT_EXTENTS
filesystem flag and access all of the non-extent files, but this would
risk filesystem corruption if any of the extent files were modified or
unlinked, as that is the only indication older kernels have of this change.

So, to answer your question, if you _really_ want to get rid of extents
on a filesystem, you mount the filesystem with INCOMPAT_EXTENTS on a new
kernel that supports extents, but without -o extents so new files will
use the old block-map layout, so if "orig-file" is an extent-mapped file:

	cp /mnt/tmp/orig-file /mnt/tmp/temp-block-mapped-file
	mv /mnt/tmp/temp-block-mapped-file /mnt/tmp/orig-file

and now /mnt/tmp/orig-file is no longer extent-mapped.  Do this for all
the extent-mapped files, unmount, use "debugfs -w -R 'feature ^extents' {dev}"
and your filesystem is mountable with any old kernel.

No, it's not quite as easy as ext3 journal recovery->ext2 mounting,
but then again "-o extents" isn't something that happens automatically
(at least not for a couple of years, and hopefully distros will be smart
enough never to do this for filesystems like /boot or / that are critical
for mounting on a wide variety of kernels.  Besides which, we don't want
to have to teach GRUB about extent-mapped files.  Concievably, if this
becomes an issue then it should be possible to add a flag to inodes and
parent directories to add a "no extents" flag that is inherited by new
files that should never be extent mapped.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10  2:03                                             ` Theodore Tso
@ 2006-06-10  2:11                                               ` Jeff Garzik
  2006-06-10  2:54                                                 ` Theodore Tso
  2006-06-10  2:58                                               ` [Ext2-devel] " Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10  2:11 UTC (permalink / raw)
  To: Theodore Tso, Jeff Garzik, Stephen C. Tweedie, Andrew Morton,
	Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
	Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas

Theodore Tso wrote:
> On Fri, Jun 09, 2006 at 09:43:14PM -0400, Jeff Garzik wrote:
>>> ???  Can you please be specific in what the performance penalty is, and
>>> what specifically is "not sized optimally" after a resize?  How exactly
>>> does inode allocation strategy relate to anything at all to online 
>>> resizing.
>> Inodes per group / inode blocks per group, as I've already stated.
> 
> Inodes per group and inode blocks per group are maintained
> across an online resize.

That's the problem I'm pointing out.


> So there is no difference in inodes per
> group for a filesystem created at size S1 and resized to size S2
> (using either an on-line or off-line resize), and a filesystem which
> is created to be size S2.

Trivial to prove false, by your statement above if nothing else.  But 
anyway:
Run mke2fs on a blkdev of size 500MB, and one of 500GB.  Note values.
Now resize blkdev formatted for size 500MB to 500GB, and note differences.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  1:43                                           ` Jeff Garzik
  2006-06-10  2:03                                             ` Theodore Tso
@ 2006-06-10  2:26                                             ` Andreas Dilger
  2006-06-10  2:31                                               ` Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10  2:26 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Theodore Tso, Matthew Frost, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

On Jun 09, 2006  21:43 -0400, Jeff Garzik wrote:
> >???  Can you please be specific in what the performance penalty is, and
> >what specifically is "not sized optimally" after a resize?  How exactly
> >does inode allocation strategy relate to anything at all to online 
> >resizing.
> 
> Inodes per group / inode blocks per group, as I've already stated.

As Stepen and Ted already replied (though I can understand if you missed
it, it seems this is a popular thread :-)- the inode count per group
is a fixed parameter for the whole filesystem that even online resizing
cannot change.

The only things that can change on a per-group basis (with either online or
offline resizing, or with mke2fs -R stride=N, or if there are bad block
on disk) is that the relative offset within the group of the inode and
block bitmaps can change, and the relative location of the inode table
within the group can change.  The size of the inode table per group (and
hence number of inodes per group) is always constant, since it is stored
in the superblock and affects the inode number->group mapping.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  2:26                                             ` Andreas Dilger
@ 2006-06-10  2:31                                               ` Jeff Garzik
  2006-06-10  4:22                                                 ` Andreas Dilger
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10  2:31 UTC (permalink / raw)
  To: Jeff Garzik, Theodore Tso, Stephen C. Tweedie, Andrew Morton,
	Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
	Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas

Andreas Dilger wrote:
> the inode count per group
> is a fixed parameter for the whole filesystem that even online resizing
> cannot change.

Correct.  Fixed... at mke2fs time.  Thus, with varying mke2fs runs, 
inodes-per-group can vary, where it does not with online resize.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10  2:09                         ` [Ext2-devel] " Andreas Dilger
@ 2006-06-10  2:45                           ` Nicholas Miell
  2006-06-10  4:29                             ` Andreas Dilger
  0 siblings, 1 reply; 296+ messages in thread
From: Nicholas Miell @ 2006-06-10  2:45 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Valdis.Kletnieks, Alex Tomas, Jeff Garzik, ext2-devel,
	linux-kernel, Christoph Hellwig, Mingming Cao, linux-fsdevel

On Fri, 2006-06-09 at 20:09 -0600, Andreas Dilger wrote:
> On Jun 09, 2006  21:21 -0400, Valdis.Kletnieks@vt.edu wrote:
> > On Fri, 09 Jun 2006 17:21:08 MDT, Andreas Dilger said:
> > > You mount with the new kernel without "-o extents", and find files with
> > > extents "lsattr -R /mnt/tmp | awk '/----e / print { $2 }'", copy those
> > > files, mv over old files, unmount.
> > 
> > How do you "copy those files" when you don't have extent support at that
> > point?  Remember - the whole problem here is that if you don't have
> > extent support, you can't read the file, it's backward-incompatible.
> > (If you *are* able to read the file even without extents, then this whole
> > thread is total BS).
> 
> The "-o extents" mount option only affects new files that are created
> while that option is enabled.  It doesn't affect existing files (even if
> they are modified while "-o extents" is set).  It also doesn't affect any
> new files after "-o extents" is removed.  Also, directories will not
> be extent-mapped, because their allocation pattern doesn't mix well with
> extent-mapped files (i.e. they are mostly single-block allocations).
> 
> Files that are created with "-o extents" are of course only readable with
> a kernel that supports it.  To be safe, the whole filesystem is marked
> with an EXT3_FEATURE_INCOMPAT_EXTENTS flag when the first extent file
> is created so that users don't inadvertently get strange errors while
> accessing the inodes marked with EXT3_EXTENT_FL with an old kernel.
> New kernels that understand INCOMPAT_EXTENTS of course can access extent
> and non-extent files equally well.
> 
> In an emergency it would also be possible to remove the INCOMPAT_EXTENTS
> filesystem flag and access all of the non-extent files, but this would
> risk filesystem corruption if any of the extent files were modified or
> unlinked, as that is the only indication older kernels have of this change.
> 
> So, to answer your question, if you _really_ want to get rid of extents
> on a filesystem, you mount the filesystem with INCOMPAT_EXTENTS on a new
> kernel that supports extents, but without -o extents so new files will
> use the old block-map layout, so if "orig-file" is an extent-mapped file:
> 
> 	cp /mnt/tmp/orig-file /mnt/tmp/temp-block-mapped-file
> 	mv /mnt/tmp/temp-block-mapped-file /mnt/tmp/orig-file
> 
> and now /mnt/tmp/orig-file is no longer extent-mapped.  Do this for all
> the extent-mapped files, unmount, use "debugfs -w -R 'feature ^extents' {dev}"
> and your filesystem is mountable with any old kernel.
> 
> No, it's not quite as easy as ext3 journal recovery->ext2 mounting,
> but then again "-o extents" isn't something that happens automatically
> (at least not for a couple of years, and hopefully distros will be smart
> enough never to do this for filesystems like /boot or / that are critical
> for mounting on a wide variety of kernels.  Besides which, we don't want
> to have to teach GRUB about extent-mapped files.  Concievably, if this
> becomes an issue then it should be possible to add a flag to inodes and
> parent directories to add a "no extents" flag that is inherited by new
> files that should never be extent mapped.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.


I think changing all of this mess to:

[root@localhost root]# tune2fs -O extents /dev/whatever
WARNING: Enabling extents on /dev/whatever will make this filesystem
unreadable in Linux kernels versions before 2.6.19!
Are you sure you want to do this? <y/n>

[root@localhost root]# tune2fs -O ^extents /dev/whatever
WARNING: Disabling extents on /dev/whatever requires you to run e2fsck
on this filesystem before it can be used again!
Are you sure you want to do this? <y/n>

might assuage many of the fears presented in this thread.
-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  2:11                                               ` [Ext2-devel] " Jeff Garzik
@ 2006-06-10  2:54                                                 ` Theodore Tso
  2006-06-10  3:11                                                   ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-10  2:54 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

On Fri, Jun 09, 2006 at 10:11:59PM -0400, Jeff Garzik wrote:
> Trivial to prove false, by your statement above if nothing else.  But 
> anyway:
> Run mke2fs on a blkdev of size 500MB, and one of 500GB.  Note values.
> Now resize blkdev formatted for size 500MB to 500GB, and note differences.

OK, so *that's* what you were trying to get at.  I wish you had said
that from the first, since most people who are creating filesystems to
resize (i.e., on LVM or RAID systems), don't start them as small as
500MB.

Yes, the default inode ratio and blocksize is different for
filesystems under 512MB.  But that's largely irrelevant for the use
cases of online resizing, where people will generally be starting with
a filesystem *far* larger than 512megs.  They might starting with an
LVM sized to be 2 gigs and resize it to 5 gigs.  Or 100 gigs and
resizing it 200 gigs; or 500gigs; or a terrabyte.  In all of those
cases, the results are identical.

It also by the way has nothing to do with the "inode allocation
algorithm", as you caleimd.  The biggest difference will come from the
use of a 1k blocksize instead of 4k blocksize, but that's a matter of
the defaults that were selected for "small" filesystems.  If someone
was creating a file system that they knew they were likely to resize
to 500GB, they could always create it with an explicitly specified
blocksize of 4k, and also specify a different inode ratio.

And this is your argument that on-line resizing is a horrible hack,
and ext3 should be thrown out and rewritten from scratch?  That's
weak.   

One other thought --- people do *care* about backwards compatibility
from a filesystem format level, and they do appreciate being able to
easily upgrade and take advantage of new filesystem features without
needing to do a dump/restore.  

If you don't care about compatibility, but want a scalable filesystem,
take a look at JFS.  It's very, very, good at what it does (and has
support for extents and large block numbers) --- and it's smaller than
XFS and doesn't have the VNODE and System V/IRIX API compatibility
crud of XFS.  The only downside with it is that you do have to do a
backup, reformat, and restore, and of course, the lack of support from
pretty much all of the major distributions.

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10  2:03                                             ` Theodore Tso
  2006-06-10  2:11                                               ` [Ext2-devel] " Jeff Garzik
@ 2006-06-10  2:58                                               ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10  2:58 UTC (permalink / raw)
  To: Theodore Tso, Andreas Dilger, Stephen C. Tweedie
  Cc: Andrew Morton, Matthew Frost, ext2-devel@lists.sourceforge.net,
	linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
	Alex Tomas

Theodore Tso wrote:
> Inodes per group and inode blocks per group are maintained
> across an online resize.  So there is no difference in inodes per
> group for a filesystem created at size S1 and resized to size S2
> (using either an on-line or off-line resize), and a filesystem which
> is created to be size S2.


Here are real numbers, which illustrate how the above two statements 
contradict, and how the second statement is false:

blkdev A, formatted with a 50MB filesystem
	block size		4096
	block count		12800 (size S1)
	inodes per group	12800
blkdev A, formatted to full capacity (~350GB)
	block size		4096
	block count		95472256 (size S2)
	inodes per group	32768

Case 1:	online resize from 50MB to 350GB
Result:	inodes per group == 12800 (it remains the same)

Case 2: mke2fs blkdev A, with no block-count restrictions
Result:	inodes per group == 32768

Thus, each inode group holds fewer inodes per group in case #1 than #2.
Thus, case #2 has greater inode density than case #1.

Overall,
a) mke2fs chooses optimal values based on creation-time block count
b) online resize does not change these values

thus the values are no longer optimal.  And in this case, they are never 
-more- optimal, and potentially -less- optimal.

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  2:54                                                 ` Theodore Tso
@ 2006-06-10  3:11                                                   ` Jeff Garzik
  2006-06-10 12:15                                                     ` Theodore Tso
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10  3:11 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

Theodore Tso wrote:
> And this is your argument that on-line resizing is a horrible hack,

It's an example of ext2 being bandaided to do something it was never 
originally designed to do.  If online resizing had been planned from the 
start, allocating new inode tables on the fly would be trivial, as it is 
in JFS/NTFS/...


> and ext3 should be thrown out and rewritten from scratch? 

Blatant and silly exaggeration.  Re-read the thread, and note how many 
times "cp -a ext3 ext4" was written.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
  2006-06-09 15:31         ` Matthew Wilcox
@ 2006-06-10  3:26           ` Valerie Henson
  2006-06-10  5:25             ` Andreas Dilger
  2006-06-10 14:22             ` Jeff Garzik
  0 siblings, 2 replies; 296+ messages in thread
From: Valerie Henson @ 2006-06-10  3:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jeff Garzik, Arjan van de Ven, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger

On Fri, Jun 09, 2006 at 09:31:16AM -0600, Matthew Wilcox wrote:
> 
> I want extents, but I'm still unconvinced that ext3 needs to grow beyond
> 32-bit blocks.  The scheme posted by Val and Arjan (with the
> continuation inodes) seems much neater.

Well, thanks!  Arjan and I like our idea too, but at this point it's
just an idea.  We'll be hashing it out some more at the file system
workshop next week.

To be honest, continuation inodes and these ext3 patches are
addressing different problems.  ext3 48-bit extents are an advanced
solution to a complex problem - growing ext3 beyond 8TB while keeping
as much of the existing on-disk format and associated stable code as
possible.  It's hard work and the ext3 developers came up with some
good ideas.  Continuation inodes are an idea about how to limit error
propagation in large file systems - an idea which happens to allow
file systems larger than 8 TB with 32-bit block pointers.

So what the heck are continuation inodes?  Actually, we named this
"chunkfs" - not particularly descriptive, maybe continuation inodes is
a better term.

Continuation inodes/chunkfs are an idea Arjan and I came up with,
inspired loosely by the ext2 dirty bit code.  The problem we were
trying to solve is how to isolate the effects of file system
corruption (from crash, bug, or I/O error) so that we didn't have to
run fsck over the entire file system in order to repair it.  This is
important because disk bandwidth is not growing as fast as disk
capacity, so the absolute time to read the entire disk is growing.
The basic idea is to create a bunch of small file systems - chunks -
which look like one big file system to the administrator.  Major
problems to solve:

1. Files which span more than one chunk (file system).
2. Hard links from a directory in chunk A to a file in chunk B.

The solution we came up with is to create a "continuation inode" in
every file system chunk which contains data for a particular file or
directory.  For example, if file "foo" has its inode in chunk A, and
some file data in chunk B, we would create a continuation inode in
chunk B.  The continuation inode has a back pointer to the parent
inode.  Now imagine there is some kind of corruption in chunk B and we
need to check the file system.  We can determine the free or allocated
state of every block in chunk B without reading any metadata outside
of chunk B.

Similarly, if we create a hard link to file "foo" in chunk A from
directory "bar" in chunk B, we will allocate a continuation inode for
directory "bar" in chunk B, and then allocate a block to contain the
link to "foo" in chunk B.  Once again, to find the link count of every
inode in chunk B, we only have to look at directories inside of chunk
B.  There are still problems that require checking across chunks, but
we only need to read inodes and directory entries in those cases and
the checks are much simpler than in existing fsck.

One interesting possibility would be to combine this with the ext2
dirty bit patches.  They create a clean/dirty bit for an ext2 file
system.  If the system crashes while the file system is being written
to, the bit is set to dirty and we do a full fsck.  If the system
crashes while it's inactive, the bit is clean, and all we have to do
is a little bit of orphan inode cleanup before mounting.  If we
implement chunkfs on top of this, we could get away with fsck'ing only
a few of the file systems each time, getting ext2-style performance
with ext3-style fast recovery.

I measured the number of different block groups that were
simultaneously dirty on my laptop's file system as a proxy for how
many chunks would be dirty; it turns out that on average most block
groups were clean 98% of the time, and when I really pushed my
(admittedly dinky) disk I/O system with an artificial load, only a
maximum of 25% of the block groups were dirty during any one second
period.  So it's tempting... We'll talk about it more next week, I
hope.

-VAL

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 23:11                 ` Andreas Dilger
  2006-06-09 23:15                   ` Jeff Garzik
@ 2006-06-10  3:37                   ` Valerie Henson
  1 sibling, 0 replies; 296+ messages in thread
From: Valerie Henson @ 2006-06-10  3:37 UTC (permalink / raw)
  To: Andrew Morton, Sonny Rao, jeff, hch, cmm, linux-kernel,
	ext2-devel, linux-fsdevel

On Fri, Jun 09, 2006 at 05:11:52PM -0600, Andreas Dilger wrote:
> On Jun 09, 2006  15:15 -0700, Andrew Morton wrote:
> > 
> > We're continuing to nurse along a few basically-15-year-old filesystems
> > while we do have the brains, manpower and processes to implement a new,
> > really great one.
> > 
> > It's just this feeling I have ;)
> 
> I think many people share this feeling (me included), hence the linux
> filesystem meeting next week...  The problem is that even getting a
> half-decent disk filesystem is many years of work, and large disks are
> here before then.  The ZFS code took 10 years to get to its current state,
> I understand, so I don't anticipate we will get there overnight.

I helped bring up the first instance of ZFS running as a kernel module
on Halloween, 2002 (one fun week staying up all night hacking with
Jeff Bonwick).  The earliest code was written in either 2001 or just
possibly 2000 - so 5-6 years in elapsed time.  On the other hand, in
terms of total programmer staff-years put into ZFS, it's on the order
of 25 years.

I'm not sure either what the best route to the next big Linux file
system is - start from scratch or reuse a lot of code.  One of the
things I want to talk about at the workshop is creative reuse of
existing code, a la the continuation inode idea.

-VAL

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 22:15               ` Andrew Morton
  2006-06-09 23:11                 ` Andreas Dilger
@ 2006-06-10  3:49                 ` Nathan Scott
  1 sibling, 0 replies; 296+ messages in thread
From: Nathan Scott @ 2006-06-10  3:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext2-devel, linux-kernel

On Fri, Jun 09, 2006 at 03:15:53PM -0700, Andrew Morton wrote:
> Sonny Rao <sonny@burdell.org> wrote:
> > On Fri, Jun 09, 2006 at 10:35:43AM -0700, Andrew Morton wrote:
> > <snip> 
> > > All that being said, Linux's filesystems are looking increasingly crufty
> > > and we are getting to the time where we would benefit from a greenfield
> > > start-a-new-one.  
> > 
> > I'm curious about this comment; in what way are they _collectively_
> > looking crufty ? 
> 
> We seem to be lagging behind "the industry" in some areas - handling large
> devices, high bandwidth IO, sophisticated on-disk data structures, advanced
> manageability, etc.

Er, no.  I'm not aware of many filesystems that are in the same
league as XFS on those first three specific points.  It certainly
has "ondisk sophistication" very well covered, trust me. ;)

We are definately not lagging on handling large devices nor high
bandwidth I/O anyway - XFS serves up very close to the hardware
capabilities for high end hardware and it scales well.  One could
come up with a different list of areas where Linux filesystems
might be lagging, but that list above ain't right.

> I mean, although ZFS is a rampant layering violation and we can do a lot of
> the things in there (without doing it all in the fs!) I don't think we can
> do all of it.

*nod*.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  2:31                                               ` Jeff Garzik
@ 2006-06-10  4:22                                                 ` Andreas Dilger
  0 siblings, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10  4:22 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Theodore Tso, Matthew Frost, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

On Jun 09, 2006  22:31 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >the inode count per group
> >is a fixed parameter for the whole filesystem that even online resizing
> >cannot change.
> 
> Correct.  Fixed... at mke2fs time.  Thus, with varying mke2fs runs, 
> inodes-per-group can vary, where it does not with online resize.

Unless specified differently at format time, the inodes-per-group will
be the same value (namely 16384) if the filesystem is larger than 512MB.
So, yes, I agree with you if you start with a tiny filesystem and try
to resize it to a gigantic filesystem you will get a different number
of inodes, but that is true whether this is online resizing or offline.

That said, for anyone who has resized their filesystem I think they prefer
to be able to resize it than not being able to do so at all.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10  2:45                           ` Nicholas Miell
@ 2006-06-10  4:29                             ` Andreas Dilger
  0 siblings, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10  4:29 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Valdis.Kletnieks, Alex Tomas, Jeff Garzik, ext2-devel,
	linux-kernel, Christoph Hellwig, Mingming Cao, linux-fsdevel

On Jun 09, 2006  19:45 -0700, Nicholas Miell wrote:
> I think changing all of this mess to:
> 
> [root@localhost root]# tune2fs -O extents /dev/whatever
> WARNING: Enabling extents on /dev/whatever will make this filesystem
> unreadable in Linux kernels versions before 2.6.19!
> Are you sure you want to do this? <y/n>
> 
> [root@localhost root]# tune2fs -O ^extents /dev/whatever
> WARNING: Disabling extents on /dev/whatever requires you to run e2fsck
> on this filesystem before it can be used again!
> Are you sure you want to do this? <y/n>
> 
> might assuage many of the fears presented in this thread.

If that were true, then I'd be happy to make this the barrier to entry.
Sadly, I don't think that is the only issue, but I'm happy to be shown
to be wrong.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
  2006-06-10  3:26           ` Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3) Valerie Henson
@ 2006-06-10  5:25             ` Andreas Dilger
  2006-06-10  5:41               ` Valerie Henson
  2006-06-10 14:22             ` Jeff Garzik
  1 sibling, 1 reply; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10  5:25 UTC (permalink / raw)
  To: Valerie Henson
  Cc: Andrew Morton, Jeff Garzik, Matthew Wilcox, Arjan van de Ven,
	ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
	Alex Tomas

On Jun 09, 2006  20:26 -0700, Valerie Henson wrote:
> To be honest, continuation inodes and these ext3 patches are
> addressing different problems.  ext3 48-bit extents are an advanced
> solution to a complex problem - growing ext3 beyond 8TB while keeping
> as much of the existing on-disk format and associated stable code as
> possible.

The 48-bit support was acutally only a small of the originalreason for
extents, while it seems to be the most popular right now.  The other
issues that are being addressed are:
- performance issues like avoiding 0.1%+ indirect block metadata overhead
  for each file which is bad for the cache, and also hurts unlinks)
- the extent index blocks are also more robust than indirect blocks (they
  have a magic and internally verifiable structure, and the possibility
  to easily add metadata checksums and extent->inode backpointers to
  allow improved filesystem checking).  With large ext3 filesystems the
  {d,t,}indirect blocks can have random garbage in them and there is no
  way for the kernel to know unless it overlaps with other fixed metadata
- the ability to do things like preallocation of files efficiently (via
  uninitialized extents), instead of zero-filling the whole file.

> Continuation inodes/chunkfs are an idea Arjan and I came up with,
> inspired loosely by the ext2 dirty bit code.  The problem we were
> trying to solve is how to isolate the effects of file system
> corruption (from crash, bug, or I/O error) so that we didn't have to
> run fsck over the entire file system in order to repair it.

I think this is a great idea, and one that is very similar to what
we are doing with ext3 filesystems in Lustre.  There is definitely
a desire to harden the ext3 code in many ways against such failures,
and being able to check independent parts of the filesystem is a
very desirable part of this.

> The solution we came up with is to create a "continuation inode" in
> every file system chunk which contains data for a particular file or
> directory.  For example, if file "foo" has its inode in chunk A, and
> some file data in chunk B, we would create a continuation inode in
> chunk B.  The continuation inode has a back pointer to the parent
> inode.

This needs some extra data in the directory entry, which I've already
been thinking about for ext3, so if you are looking at implementing
this for ext3 I'd be happy to share some ideas.

> One interesting possibility would be to combine this with the ext2
> dirty bit patches.

Put on your asbestos vest before suggesting any changes to ext2 :-).

> If we implement chunkfs on top of this, we could get away with fsck'ing
> only a few of the file systems each time, getting ext2-style performance
> with ext3-style fast recovery.

While fast recovery is one aspect of ext3 journaling, the other one
is that this allows multiple filesystem changes to be made atomically
and they are rolled back as a set if the system crashes in the middle.

> We'll talk about it more next week, I hope.

I look forward to it.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
  2006-06-10  5:25             ` Andreas Dilger
@ 2006-06-10  5:41               ` Valerie Henson
  2006-06-10  6:22                 ` Andreas Dilger
  0 siblings, 1 reply; 296+ messages in thread
From: Valerie Henson @ 2006-06-10  5:41 UTC (permalink / raw)
  To: Matthew Wilcox, Alex Tomas, Andrew Morton, Jeff Garzik,
	ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
	Arjan van de Ven

On Fri, Jun 09, 2006 at 11:25:02PM -0600, Andreas Dilger wrote:
> 
> The 48-bit support was acutally only a small of the originalreason for
> extents, while it seems to be the most popular right now.

*nod*

> This needs some extra data in the directory entry, which I've already
> been thinking about for ext3, so if you are looking at implementing
> this for ext3 I'd be happy to share some ideas.

Actually, it seems vaguely possible this could be implemented as a
layer on top of any normal file system - just use files to store
continuation inodes and the like.  Then you could use the file system
that best suits your workload underneath. (Suparna has a paper in the
next OLS talking about something related but not identical, check it
out.) Most likely it would be criminally wasteful of space and really
slow, but it's something to think about.

> > One interesting possibility would be to combine this with the ext2
> > dirty bit patches.
> 
> Put on your asbestos vest before suggesting any changes to ext2 :-).

*laugh* What about ext2.5? :) Seriously, ext2 needs to be left alone,
but I'm open to the possibility that any of the existing file system
code bases could be forked off into a development file system.  Some
ideas would be more compatible with some code bases than others, and
forking might get rid of some constraints - e.g., an XFS fork could
get rid of a lot of crufty compat code.

-VAL

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
  2006-06-10  5:41               ` Valerie Henson
@ 2006-06-10  6:22                 ` Andreas Dilger
  0 siblings, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10  6:22 UTC (permalink / raw)
  To: Valerie Henson
  Cc: Andrew Morton, Jeff Garzik, Matthew Wilcox, Arjan van de Ven,
	ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
	Alex Tomas

On Jun 09, 2006  22:41 -0700, Valerie Henson wrote:
> On Fri, Jun 09, 2006 at 11:25:02PM -0600, Andreas Dilger wrote:
> > This needs some extra data in the directory entry, which I've already
> > been thinking about for ext3, so if you are looking at implementing
> > this for ext3 I'd be happy to share some ideas.
> 
> Actually, it seems vaguely possible this could be implemented as a
> layer on top of any normal file system - just use files to store
> continuation inodes and the like.  Then you could use the file system
> that best suits your workload underneath.

That is basically Lustre.  One filesystem (the metadata filesystem, MDS)
holds just the pathnames and some EA data that points to other files
(these are essentially "file continuation inodes").  The data filesystems
(object storage filesystems, OST) have the file data RAID0 striped over
multipe OST "objects".  The objects are just regular files stored in
ext3 filesystems.

In clustered metadata Lustre (CMD) there are also continuation inodes for
files in a single directory, but currently a 2TB MDS filesystem is plenty
big for holding just filenames and inodes.

The same problems exist with Lustre that you have to face with the
continuation inode scheme - files that grow too large for a single
chunk, cross-chunk namespace links, etc.

Of course we'd be thrilled if there was a desire to implement Lustre
at a completely local-filesystem level (removing a lot of the networking
and required recovery mechanism), though it would also be desirable to
have the ability to move a filesystem from a local box to a distributed
filesystem (ala X11) without any changes.

> (Suparna has a paper in the next OLS talking about something related
> but not identical, check it out.)

Interesting, I'll have to take a look.

> forking might get rid of some constraints - e.g., an XFS fork could
> get rid of a lot of crufty compat code.

It continually amazes me that XFS even made it into the kernel as it
currently stands, because of the normally vehement objections to any
kind of abstraction of code.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:39                         ` Gerrit Huizenga
  2006-06-09 19:45                           ` [Ext2-devel] " Jeff Garzik
@ 2006-06-10 10:03                           ` Christoph Hellwig
  1 sibling, 0 replies; 296+ messages in thread
From: Christoph Hellwig @ 2006-06-10 10:03 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas

On Fri, Jun 09, 2006 at 12:39:19PM -0700, Gerrit Huizenga wrote:
> > PRECISELY.  So you should stop modifying a filesystem whose design is 
> > admittedly _not_ modern!
> 
> So just how long do you think it would take to get a modern filesystem
> into the hands of real users, supported by the distros?  From community
> building, through design, development, testing, delivery?

JFS is pretty nice because it has many adavanced features but still is
rather simple.  XFS has even more cool features such as a WIP parallel
fsck and is proven on the biggest filesystems on COS operating systems
out there, but as a disadvantage is hugely complex so outsiders have a
hard time getting into it.

So shortem the option I'd recommend is to start supporting XFS more broadly,
because it's the high end filesystem that's out there today and fill the
needs people have in the next five or so years.

For the time after that we need to think about something that can scale
aswell and better while beeing simpler.  Also we need to start thinking
about a clustered filesystem more, it might or might not make sense to
have a cluster filesystem also do the next generation local filesystem
thing.  I'd probably start designing such a next gen fs by taking jfs
and revamping it completely.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  3:11                                                   ` Jeff Garzik
@ 2006-06-10 12:15                                                     ` Theodore Tso
  2006-06-10 14:31                                                       ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-10 12:15 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

On Fri, Jun 09, 2006 at 11:11:31PM -0400, Jeff Garzik wrote:
> It's an example of ext2 being bandaided to do something it was never 
> originally designed to do.  If online resizing had been planned from the 
> start, allocating new inode tables on the fly would be trivial, as it is 
> in JFS/NTFS/...

And once again this has *nothing* to do with inode allocation, or
dynamic allocation of inode tables.  Your "performance issue" has to
do with a difference in blocksizes.  If you ext2/3 to pass your silly
test, then upgrade to the latest e2fsprogs and install the following
/etc/mke2fs.conf:

[defaults]
	base_features = sparse_super,filetype,resize_inode,dir_index
	blocksize = 4096
	inode_ratio = 8192

[fs_types]
	small = {
		blocksize = 4096
		inode_ratio = 8192
	}
	floppy = {
		blocksize = 4096
		inode_ratio = 8192
	}

Happy now?

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:00                           ` Chase Venters
@ 2006-06-10 13:33                             ` Adrian Bunk
  0 siblings, 0 replies; 296+ messages in thread
From: Adrian Bunk @ 2006-06-10 13:33 UTC (permalink / raw)
  To: Chase Venters
  Cc: Linus Torvalds, Alex Tomas, Andreas Dilger, Jeff Garzik,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

On Fri, Jun 09, 2006 at 02:00:15PM -0500, Chase Venters wrote:
> On Fri, 9 Jun 2006, Chase Venters wrote:
> 
> >On Fri, 9 Jun 2006, Linus Torvalds wrote:
> >
> >>
> >>
> >> On Fri, 9 Jun 2006, Alex Tomas wrote:
> >>> 
> >>>  would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
> >>
> >> Let's put it this way:
> >> - have you had _any_ valid argument at all against "ext4"?
> >>
> >> Think about it. Honestly. Tell me anything that doesn't work?
> >
> >Now, granted, I really do agree with you about the whole code sharing 
> >thing. A fresh start is often just what you need. I'm just questioning if 
> >it wouldn't be better to do this fresh start immediately after going 
> >48-bit, rather than before. That way, existing users that want that extra 
> >umph can have it today.
> >
> 
> Let me clarify that I don't have a final answer or opinion for whether or 
> not 48-bit belongs in ext3 or ext4. But I'm trying to illustrate that it's 
> an important question to raise.
> 
> In Group A we have some number of users that must have 48-bit support by 
> Date B. 48-bit support could be available in ext3 by Date A, before Date 
> B. It could also be available in ext4 by Date X, along with a handful of 
> other features.
> 
> Is Date X before Date B? If it's not, is it worth telling Group A to 
> suffer for a while, or asking them to use ext4 before it's ready? These 
> are the questions I'd have to know the answers to if I were the one 
> casting a final decision.

There are many points mentioned in this discussion like:
- possibility of regressions for existing users
- time until the new code is actually stable and well-tested
- long-term maintainability

The faster availability is a point, but it's only one amongst many 
points.

And it's not that we are talking about a feature not yet available in 
Linux at all. Instead of suffering, couldn't the few people in urgent 
need of 48-bit support use JFS or XFS?

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:58             ` Gerrit Huizenga
  2006-06-09 18:25               ` [Ext2-devel] " Chase Venters
@ 2006-06-10 13:46               ` Adrian Bunk
  2006-06-10 14:42                 ` Ingo Molnar
  2006-06-13 13:34               ` [Ext2-devel] " Helge Hafting
  2 siblings, 1 reply; 296+ messages in thread
From: Adrian Bunk @ 2006-06-10 13:46 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel, Andreas Dilger

On Fri, Jun 09, 2006 at 10:58:00AM -0700, Gerrit Huizenga wrote:
> 
> On Fri, 09 Jun 2006 09:09:01 PDT, Linus Torvalds wrote:
> > On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
> > > 
> > > Jeff's approach taken to the rediculous would mean that we'd have
> > > ext versions 1-40 by now at least.  I don't think that helps much,
> > > either.
> > 
> > On the other hand, I _guarantee_ you that it helps that we have ext2-3, 
> > and not just ext2 (nobody even tried to keep ext1 compatible, thank the 
> > Gods).
>  
> I had originally argued for ext4 as well based on the fact that it would
> allow lots of potential cleanups & simplifications and at the same time
> would allow a break in the on disk filesystems layout.
> 
> These changes don't yet change the actual on-disk layout and that might
> be something that would be done if ext4 were a real, new filesystem.
> 
> But then how long until ext4 is used enough to be put into production?
> How much testing will it *really* get in any form?  How long before
> the people that are using 100 TB+ disk farms today (some of which are
> chopping filesystems into 2-8 GB chunks, others with 2 TB filesystems
> today) actually trust this new filesystem (most vendors don't support
> JFS today, XFS support isn't much better).

You want to get the new features into ext3 instead of creating ext4 for
getting them better tested.

Other people in this thread want to get the new features into ext3 
instead of creating ext4 telling that this won't do harm for existing 
users since users will have to explicitely enable it.

Hearing people using contrary arguments in the same discussion always 
sounds as if they don't actually know what they want to do...

> We are seeing storage needs increasing at a frightening rate.  Health
> Care folks want to store your MRI's, x-ray's, ultraounds, etc. in high
> res digital format across your entire life in near-line format.  Terabytes
> over time per person.  Europe is already doing this pretty extensively,
> the US is following suit.  Digital media creation has huge storage needs.
> Most everything is moving to podcasts, webcasts, streaming audio & video.
> Storage is huge, and ext3 is at the current breaking point.
> 
> I'd argue that whatever we call it, we need a standard, stable, supported
> solution *soon* for large files, large filesystems, large storage systems
> in Linux.
> 
> I'd think the quickest path is to relieve the pressure now in ext3.

Why aren't JFS and XFS good enough for relieving the pressure now?

> We still haven't solved the filesystem check time problem, which is the
> next big bugaboo.  But getting large fileysstems to real customers soon,
> e.g. in mainline, well tested, ready for distro support is my real goal.
>...

Other people have the "no regressions for existing ext3 users" goal.

> gerrit

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:27                 ` [Ext2-devel] " Mike Snitzer
  2006-06-09 18:54                   ` Jeff Garzik
@ 2006-06-10 13:49                   ` Adrian Bunk
  2006-06-10 13:51                     ` Christoph Hellwig
  1 sibling, 1 reply; 296+ messages in thread
From: Adrian Bunk @ 2006-06-10 13:49 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Garzik, Andrew Morton, hch, linux-fsdevel, ext2-devel, cmm,
	linux-kernel

On Fri, Jun 09, 2006 at 02:27:53PM -0400, Mike Snitzer wrote:
> On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
> >Jeff Garzik wrote:
> >> I disagree completely...  it would be an obvious win:  people who want
> >> stability get that, people who want new features get that too.
> >
> >And developers have a better outlet for their wacky developmental urges...
> 
> And no real-world near-term progress is made for production users with
> modern requirements. What you're advocating breeds instability in the
> near-term.

There's also the old-fashioned "no regressions" requirement.

You are trading near-term instability for the few users with "modern 
requirements" against possible regressions for a large userbase.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10 13:49                   ` Adrian Bunk
@ 2006-06-10 13:51                     ` Christoph Hellwig
  2006-06-10 14:54                       ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Christoph Hellwig @ 2006-06-10 13:51 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Mike Snitzer, Jeff Garzik, Andrew Morton, hch, linux-fsdevel,
	ext2-devel, cmm, linux-kernel

On Sat, Jun 10, 2006 at 03:49:46PM +0200, Adrian Bunk wrote:
> > And no real-world near-term progress is made for production users with
> > modern requirements. What you're advocating breeds instability in the
> > near-term.
> 
> There's also the old-fashioned "no regressions" requirement.
> 
> You are trading near-term instability for the few users with "modern 
> requirements" against possible regressions for a large userbase.

Alex mentioned a few times that the extents code just adds three if.
I'm pretty sure that will not give you any regressions in the existing
codebase.  Can we concentrate on the more useful discussion topics now?


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10  1:06           ` Theodore Tso
@ 2006-06-10 14:07             ` Olivier Galibert
  2006-06-10 19:52               ` Theodore Tso
  0 siblings, 1 reply; 296+ messages in thread
From: Olivier Galibert @ 2006-06-10 14:07 UTC (permalink / raw)
  To: Theodore Tso, Sven-Haegar Koch, Michael Poole, Jeff Garzik,
	Andrew Morton, Christoph Hellwig, cmm, linux-kernel, ext2-devel,
	linux-fsdevel

On Fri, Jun 09, 2006 at 09:06:51PM -0400, Theodore Tso wrote:
> On Sat, Jun 10, 2006 at 02:49:32AM +0200, Sven-Haegar Koch wrote:
> > I see a different problem with "ext3 + extends is not ext3 anymore" when 
> > the feature goes mainstream:
> > - user with old distri, no extends in use, no kernel support for them
> > - user has some kind of problem
> > - uses new rescue disk (aka knoppix at the time of problem) - that then
> >   is current stuff, and certainly uses extents - fixes problem on disk
> >   (may be a simple as running lilo/grub from chroot, happens often for me)
> > - tries to boot back into his distri -> *boom* he lost
> 
> Incorrect, because unless you explicitly enable the use of extents,
> the mere act of using a new kernel such as might be found on knoppix
> will not result in the filesystem utilizing the extent feature.

And how shall the rescue/live CD know whether to use the feature?

  OG.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
  2006-06-10  3:26           ` Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3) Valerie Henson
  2006-06-10  5:25             ` Andreas Dilger
@ 2006-06-10 14:22             ` Jeff Garzik
  1 sibling, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10 14:22 UTC (permalink / raw)
  To: Valerie Henson
  Cc: Andrew Morton, Matthew Wilcox, Arjan van de Ven, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger

Valerie Henson wrote:
> So what the heck are continuation inodes?  Actually, we named this
> "chunkfs" - not particularly descriptive, maybe continuation inodes is
> a better term.
[...]
> The basic idea is to create a bunch of small file systems - chunks -
> which look like one big file system to the administrator.  Major

Back when I was still playing with my experimental filesystem, one of 
the short-list features I was planning on implementing was the 
allocation of both metadata and data from the same underlying data 
store, essentially collections of "buckets" for data.

The data store would be a succession of progressively-smaller buckets. 
Typical bucket sizes (chosen by admin) on a single filesystem might be: 
1G, 128M, 4M, 1M, 64k, 4k.  The largest (top-most) bucket is the 
fundamental unit of allocation for the filesystem, from which all other 
metadata and data is read/allocated.

So in my example above, the 1G bucket is analagous to a single chunk in 
chunkfs, and any number of 1G buckets -- from any number of block 
devices -- may comprise a single filesystem.

New inode tables, bitmap chunks, directories, large files, etc. are all 
allocated from an "appropriate" bucket.  IMO this type of solution 
provides fsck-friendly isolation, and adds sufficient flexibility for 
doing things like delayed alloc, metadata-is-a-file, etc.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 12:15                                                     ` Theodore Tso
@ 2006-06-10 14:31                                                       ` Jeff Garzik
  0 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10 14:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, Alex Tomas

Theodore Tso wrote:
> On Fri, Jun 09, 2006 at 11:11:31PM -0400, Jeff Garzik wrote:
>> It's an example of ext2 being bandaided to do something it was never 
>> originally designed to do.  If online resizing had been planned from the 
>> start, allocating new inode tables on the fly would be trivial, as it is 
>> in JFS/NTFS/...
> 
> And once again this has *nothing* to do with inode allocation, or
> dynamic allocation of inode tables.  Your "performance issue" has to
> do with a difference in blocksizes.  If you ext2/3 to pass your silly
> test, then upgrade to the latest e2fsprogs and install the following
> /etc/mke2fs.conf:

WTF?  In none of my examples did block size ever change.  In none of my 
examples was block size ever mentioned as a factor.

Inode density was demonstrably different in the resize vs. mkfs cases.

And online resize -obviously- imposes a limit on inode density, by 
locking inodes-per-group at fs creation time.  Dynamic allocation of 
inode tables would permit dynamic sizing of inode tables based on 
current needs, rather than needs determined at fs creation time.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 13:46               ` Adrian Bunk
@ 2006-06-10 14:42                 ` Ingo Molnar
  2006-06-10 15:03                   ` Jeff Garzik
                                     ` (3 more replies)
  0 siblings, 4 replies; 296+ messages in thread
From: Ingo Molnar @ 2006-06-10 14:42 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger

* Adrian Bunk <bunk@stusta.de> wrote:

> > I'd argue that whatever we call it, we need a standard, stable, 
> > supported solution *soon* for large files, large filesystems, large 
> > storage systems in Linux.
> > 
> > I'd think the quickest path is to relieve the pressure now in ext3.
> 
> Why aren't JFS and XFS good enough for relieving the pressure now?

Compatibility? Upgradability? Simplicity? Supportability?

Even ignoring all those arguments, i find your "ext3/ext4 is too 
complex, use XFS or JFS" argument a bit naive. Please take a quick look 
at the linecount of the filesystems in question:

                  LOC
   ------------------
   ext2:         7492
   ext3+jbd:    22197
   ext4+jbd:    24312

   reiser3:     28857
   reiser4:     79189

   JFS:         32819

   XFS:        110718

the ext3 -> ext4 patches add +2115 lines of code (which 2115 lines solve 
the biggest performance and scaling problem ext3 currently has), which 
is 1.9% of the linecount of XFS.

Q.E.D.

> > We still haven't solved the filesystem check time problem, which is the
> > next big bugaboo.  But getting large fileysstems to real customers soon,
> > e.g. in mainline, well tested, ready for distro support is my real goal.
> >...
> 
> Other people have the "no regressions for existing ext3 users" goal.

frankly, i'll leave that decision to the ext3 developers and obviously, 
to distributors. Their filesystem has handled my data for 10 years, and 
they have been very conservative about their technical choices 
throughout. I trust them to not mess up this time either.

ext3 does quite a few things to stay compatible with ext2 - and frankly, 
i very much expected it to do that when i migrated my ext2 data to ext3. 
The days of "change the world in an incompatible way and dont look back" 
are gone.

	Ingo

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 13:51                     ` Christoph Hellwig
@ 2006-06-10 14:54                       ` Jeff Garzik
  2006-06-10 18:01                         ` [Ext2-devel] " Andreas Dilger
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10 14:54 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton
  Cc: ext2-devel, linux-kernel, cmm, linux-fsdevel, Adrian Bunk

Christoph Hellwig wrote:
> On Sat, Jun 10, 2006 at 03:49:46PM +0200, Adrian Bunk wrote:
>>> And no real-world near-term progress is made for production users with
>>> modern requirements. What you're advocating breeds instability in the
>>> near-term.
>> There's also the old-fashioned "no regressions" requirement.
>>
>> You are trading near-term instability for the few users with "modern 
>> requirements" against possible regressions for a large userbase.
> 
> Alex mentioned a few times that the extents code just adds three if.
> I'm pretty sure that will not give you any regressions in the existing
> codebase.  Can we concentrate on the more useful discussion topics now?

Alex is off by an order of magnitude.  I've re-read the 13-patch series, 
and this is the result of the review:

There are _five_ "if (new) .. else .." constructs added in JBD alone.

Three added in extent map support.

Twenty-seven (27) such constructs in 48-bit physical block support.

Two more in 48-bit ACL support.

And finally, the superblock changes don't add any branches, like the 
other code does, but it does double the endian conversion work that 
-every- user must do, even if they don't use 48bit at all.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 14:42                 ` Ingo Molnar
@ 2006-06-10 15:03                   ` Jeff Garzik
  2006-06-11  6:00                     ` Ingo Molnar
  2006-06-10 16:00                   ` Adrian Bunk
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10 15:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, ext2-devel, linux-kernel, Adrian Bunk,
	Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger

Ingo Molnar wrote:
> the ext3 -> ext4 patches add +2115 lines of code (which 2115 lines solve 
> the biggest performance and scaling problem ext3 currently has), which 
> is 1.9% of the linecount of XFS.

Indeed!


> ext3 does quite a few things to stay compatible with ext2 - and frankly, 
> i very much expected it to do that when i migrated my ext2 data to ext3. 
> The days of "change the world in an incompatible way and dont look back" 
> are gone.

I agree with your point in the thread -- most users and distros don't 
change their main fs on a whim.  But I also point out that these 
extent+48bit changes _do_ change the format in an incompatible way...

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 14:42                 ` Ingo Molnar
  2006-06-10 15:03                   ` Jeff Garzik
@ 2006-06-10 16:00                   ` Adrian Bunk
  2006-06-10 16:05                   ` Christoph Hellwig
  2006-06-10 23:05                   ` Mike Galbraith
  3 siblings, 0 replies; 296+ messages in thread
From: Adrian Bunk @ 2006-06-10 16:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger

On Sat, Jun 10, 2006 at 04:42:28PM +0200, Ingo Molnar wrote:
> 
> * Adrian Bunk <bunk@stusta.de> wrote:
> 
> > > I'd argue that whatever we call it, we need a standard, stable, 
> > > supported solution *soon* for large files, large filesystems, large 
> > > storage systems in Linux.
> > > 
> > > I'd think the quickest path is to relieve the pressure now in ext3.
> > 
> > Why aren't JFS and XFS good enough for relieving the pressure now?
> 
> Compatibility? Upgradability? Simplicity? Supportability?
> 
> Even ignoring all those arguments, i find your "ext3/ext4 is too 
> complex, use XFS or JFS" argument a bit naive. Please take a quick look 
> at the linecount of the filesystems in question:
>...

You missed my point (or I didn't make it clear enough):

It's no question that an improved version of ext3 will be available.
The only question is whether it will be ext3 or ext4.

My point was that if it takes a bit longer in the ext4 case, and during 
this time some people have this pressure of requiring it, they have the 
workaround of using other file systems.

Whether the "improve ext3" or the ext4 approach are better is a 
different question. Whether ext3 is better than XFS is also not what I 
was talking about.

It's simply that for the few people who need it now, other file systems 
are available as a workaround.

> 	Ingo

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 14:42                 ` Ingo Molnar
  2006-06-10 15:03                   ` Jeff Garzik
  2006-06-10 16:00                   ` Adrian Bunk
@ 2006-06-10 16:05                   ` Christoph Hellwig
  2006-06-10 23:05                   ` Mike Galbraith
  3 siblings, 0 replies; 296+ messages in thread
From: Christoph Hellwig @ 2006-06-10 16:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, Adrian Bunk,
	Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger

On Sat, Jun 10, 2006 at 04:42:28PM +0200, Ingo Molnar wrote:
> Even ignoring all those arguments, i find your "ext3/ext4 is too 
> complex, use XFS or JFS" argument a bit naive. Please take a quick look 
> at the linecount of the filesystems in question:

That isn't interesting at all.  There's a lot more interesting features
in jfs and xfs.  XFS is still quite bloated even compared to it's features,
but it's doing much more than just and ext3+extents.  At a smaller scale
that's true for jfs aswell.

As mentioned a few times below just getting over the 8TB barrier is far
from enough forthe next gen linux filesystems.  XFS already goes on to
address the Petabyte barrier.  It's not like it couldn't address Petabytes
of storage from the very beginning but you have such problems as needing
a parallel fsck, fault tolerance, lots of parallelism in the filesystem
and things like delayed allocations to hit linerate on dozends of FC HBAs
in the system.

And not, I don't want to bitch about ext3, it's doing good work for my
on most of my machines, but it's definitly not what I would want to
scale to really large filesystems.  It's UFS done right, but the time
of UFS derivates is slowly passing.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10 14:54                       ` Jeff Garzik
@ 2006-06-10 18:01                         ` Andreas Dilger
  0 siblings, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-10 18:01 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Christoph Hellwig, Andrew Morton, Adrian Bunk, Mike Snitzer,
	linux-fsdevel, ext2-devel, cmm, linux-kernel

On Jun 10, 2006  10:54 -0400, Jeff Garzik wrote:
> Christoph Hellwig wrote:
> >Alex mentioned a few times that the extents code just adds three if.
> >I'm pretty sure that will not give you any regressions in the existing
> >codebase.  Can we concentrate on the more useful discussion topics now?
> 
> Alex is off by an order of magnitude.  I've re-read the 13-patch series, 
> and this is the result of the review:

Thanks for at least looking at the code, which was the intention of posting
the patches...  It caused quite a few more ruffled feathers than we expected.

> Three added in extent map support.

As Christoph quoted Alex, "the extents code", which you confirm is 3 "ifs".

> There are _five_ "if (new) .. else .." constructs added in JBD alone.

Actually, 64-bit support in the JBD code was written by Zach Brown
for OCFS, so I think they want this patch into the kernel regardless.
It's relatively simple change though - all conditional on a single flag.

> Twenty-seven (27) such constructs in 48-bit physical block support.

Though there are really only 2 conditionals (in macros, one for read and
one for write) that are used everywhere, so it's not as bad as it seems.

> Two more in 48-bit ACL support.
> 
> And finally, the superblock changes don't add any branches, like the 
> other code does, but it does double the endian conversion work that 
> -every- user must do, even if they don't use 48bit at all.

These are all related to 48-bit filesystem support, not strictly
extents.  Much of the 48-bit code is dependent upon CONFIG_LBD or
sizeof(ext3_fsblk_t), so if people have no desire to use large (2TB+) or
larger (16TB+) filesystems these conditionals disappear at compile time.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 15:25       ` Jeff Garzik
  2006-06-09 15:40         ` Linus Torvalds
@ 2006-06-10 19:10         ` Kyle Moffett
  2006-06-10 19:27           ` Linus Torvalds
  1 sibling, 1 reply; 296+ messages in thread
From: Kyle Moffett @ 2006-06-10 19:10 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
	linux-fsdevel, Andreas Dilger

On Jun 9, 2006, at 11:25:31, Jeff Garzik wrote:
> Overall, I'm surprised that ext3 developers don't see any of the  
> problems related to progressive, stealth filesystem upgrades.
>
> Users are never given a clear indication of when their metadata is  
> being upgraded, there is no clear "line of demarcation" they cross,  
> when they start using extents.
>
> Since there is no user-visible fs upgrade event, users do not have  
> a clear picture of what features are being used -- which means they  
> are kept in the dark about which kernels are OK to use on their data.
>
> Do you guys honestly expect users to keep track of which kernels  
> added specific ext3 features?
>
> This is why other enterprise filesystems have clear "fs version 1",  
> "fs version 2" points across which a user migrates.  ext3's feature- 
> flags approach just means that there are a million combinations of  
> potential old-and-new features, in-tree and third party, all of  
> which must be supported.

One possible solution to the version-confusion that would avoid  
duplicating features would be to merge the fs/ext{2,3} to fs/ext,  
then make fs/ext register itself as a filesystem under "ext2",  
"ext3", and "ext4".  Then have each name imply a specific set of  
features and compatibility.  That would allow the same performance  
optimizations to affect all 3 even as you make metadata changes in  
the latest version.  I've heard quite some griping about the amount  
of duplicated code between ext2 and ext3; why cause those problems  
again with an "ext4"?  There would probably be some fs/ext/ext{2,3,4} 
_foo.c files that could be compiled in or out depending on configured  
FS support, but I would guess that would make it easier on users and  
developers alike.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:01                           ` Jeff Garzik
@ 2006-06-10 19:27                             ` Kyle Moffett
  2006-06-10 19:44                               ` Linus Torvalds
  0 siblings, 1 reply; 296+ messages in thread
From: Kyle Moffett @ 2006-06-10 19:27 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Chase Venters,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger

On Jun 9, 2006, at 15:01:20, Jeff Garzik wrote:
> Chase Venters wrote:
>> Now, granted, I really do agree with you about the whole code  
>> sharing thing. A fresh start is often just what you need. I'm just  
>> questioning if it wouldn't be better to do this fresh start  
>> immediately after going 48-bit, rather than before. That way,  
>> existing users that want that extra umph can have it today.
>
> Then you continue to crap up the code with
>
> 	if (48bit)
> 		...
> 	else
> 		...
>
> etc.
>
> The proper way to do this is "cp -a ext3 ext4" (excluding JBD as  
> Andrew mentioned), and then let evolution take its course.

Why not: "extX_ops.do_something_useful();", then have fs/ext/ext 
{2,3,4}_ops.c which implement those various operations just like we  
do for the Virtual Filesystem Switch?  Much as there are  
commonalities between all filesystems that get moved into the VFS;  
perhaps we should have a Virtual Ext Filesystem Switch (VEFS?  
VextFS?) which abstracts out the commonalities between the evolving  
ext{2,3} code and data format?  Such code would also provide a  
library of common routines which could be used to implement other  
specialized filesystems in the future.  Imagine a cluster-extfs which  
reuses some of the core extXfs code despite changing the on-disk  
format considerably!

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 19:10         ` Kyle Moffett
@ 2006-06-10 19:27           ` Linus Torvalds
  0 siblings, 0 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-10 19:27 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
	linux-fsdevel, Andreas Dilger

On Sat, 10 Jun 2006, Kyle Moffett wrote:
> 
> One possible solution to the version-confusion that would avoid duplicating
> features would be to merge the fs/ext{2,3} to fs/ext, then make fs/ext
> register itself as a filesystem under "ext2", "ext3", and "ext4".

But the thing is, technical people don't actually care about the version 
confusion.

The real issue is that ext3 is a stable filesystem, and the ext4 stuff 
buys fundamentally and absolutely _nothing_ for the vast majority of uses. 
Except pain.

So the real reason for the split would be the _user_ split. There are 
people who want big filesystems, and there are people who don't care. 

It's that simple.

> I've heard quite some griping about the amount of duplicated code 
> between ext2 and ext3;

That's a total piece of bullshit. Nobody seriously gripes about the 
duplication, and the ones that do have absolutely no idea what that split 
bought us. Ignore them.

			Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 19:27                             ` Kyle Moffett
@ 2006-06-10 19:44                               ` Linus Torvalds
  2006-06-10 20:02                                 ` [Ext2-devel] " Linus Torvalds
  0 siblings, 1 reply; 296+ messages in thread
From: Linus Torvalds @ 2006-06-10 19:44 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger

On Sat, 10 Jun 2006, Kyle Moffett wrote:
> 
> Why not: "extX_ops.do_something_useful();", then have fs/ext/ext{2,3,4}_ops.c

I think that kind of setup is hugely preferable to conditionals in the 
code, if only because it tends to force people to do the abstractions 
right, and make the code sequences independent.

I just don't think it's necessarily very realistic - it's _hard_ to 
refactor code well. It also doesn't buy you hardly anything at all, since 
the people who are interested in ext2 are usually not very interested in 
sharing code with ext3. The filesystems simply aren't that similar, apart 
from the layout. 

ext2 is half the size of ext3, and that's ignoring JBD entirely.

That constant growth, btw, is one reason why splitting off legacy 
filesystems is often a good idea. What do you want to bet that the 2000+ 
line difference RIGHT NOW in ext3/ext4 will grow in the future? Splitting 
things off means that people who don't care about the new features can 
just stay with a stable base and also avoid the bloat. Exactly the way you 
can stay with ext2 on an old machine, and avoid the bloat of ext3.

There's also nothign that says that legacy filesystems cannot be 
simplified. For example, it's perfectly realistic to say that ext3 (as a 
legacy filesystem) doesn't support resizing, and simply ripping that part 
out of it. The people who don't want the bloat will be happy. The people 
who want the feature can move to ext4.

See? Splitting development is what allows you to make choices that you 
simply otherwise don't _have_. 

			Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 14:07             ` Olivier Galibert
@ 2006-06-10 19:52               ` Theodore Tso
  0 siblings, 0 replies; 296+ messages in thread
From: Theodore Tso @ 2006-06-10 19:52 UTC (permalink / raw)
  To: Olivier Galibert, Sven-Haegar Koch, Michael Poole, Jeff Garzik,
	Andrew Morton, Christoph Hellwig, cmm, linux-kernel, ext2-devel,
	linux-fsdevel

On Sat, Jun 10, 2006 at 04:07:14PM +0200, Olivier Galibert wrote:
> > Incorrect, because unless you explicitly enable the use of extents,
> > the mere act of using a new kernel such as might be found on knoppix
> > will not result in the filesystem utilizing the extent feature.
> 
> And how shall the rescue/live CD know whether to use the feature?

Because there will be a bit the superblock that the user will have to
explicitly enable in order to get extents, so a new kernel on the
rescue/live CD will no whether or not extents are allowed --- just as
today, you have to explicitly enable hashed tree directory indexing
with the command, tune2fs -O dir_index /dev/hdXXX.

							- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10 19:44                               ` Linus Torvalds
@ 2006-06-10 20:02                                 ` Linus Torvalds
  2006-06-10 21:26                                   ` Theodore Tso
  0 siblings, 1 reply; 296+ messages in thread
From: Linus Torvalds @ 2006-06-10 20:02 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Jeff Garzik, Chase Venters, Alex Tomas, Andreas Dilger,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

On Sat, 10 Jun 2006, Linus Torvalds wrote:
> 
> ext2 is half the size of ext3, and that's ignoring JBD entirely.

Btw, let me say again that I'm fairly neutral on any particular individual 
feature (ie the 48-bit thing doesn't actually move me all that much in 
itself), but that from a maintenance standpoint, I think splitting off 
filesystems and drivers has been a _huge_ success.

Starting from scratch - even if you literally start from the same 
code-base - and allowing the old functionality to remain undisturbed is 
just a very nice model. Yeah, yeah, it has some diskspace cost (although 
at least from a git perspective, even that isn't really true), but we've 
seen both in drivers and in filesystems how splitting things up has been a 
great thing to do.

Sometimes it's a great thing just because five years later, it turns out 
that nobody even uses the legacy thing, and you decide to at that point 
just remove the driver (or filesystem, but so far it's never been the 
case for filesystems even if smbfs is a potential victim of this in the 
not _too_ distant future), because the new version simply does everything 
better.

And that's _not_ a failure of the model. It's a success too. But so is the 
above commentary on ext2, when the "old driver/filesystem is still used 
and maintained by odd people". It's just two different possible outcomes 
of the decision to do development separately from an older user base.

And again, I'd like to stress the _user_base_ over the _code_base_. In 
many ways, that's the much more important split. I suspect Jeff has seen 
this in drivers, where a lot of users simply do not want to have a new 
driver, because it does some huge fundamental improvement for new users 
but doesn't work for old ethernet cards, for example, because it missed 
some old use case depended on a legacy feature that just doesn't fit well 
into the new (and obviously improved) world-view.

So we've often seen a driver that _could_ have handled different versions 
of the same card/chip split into an "old" and a "new" driver, and on the 
whole it has always been positive - even if eventually the old driver just 
becomes irrelevant for one reason or another.

Duplication isn't actually bad. It's what often allows experimentation, 
and streamlining. In drivers, for example, duplication is _often_ done as 
part of simply dropping support for old cards in the new version, but also 
by dropping and simplifying the old driver that now has a much clearer 
"raison d'etre", aka "user base".

Which gets me back to the whole "'user base' matters more than 'code 
base'" argument, because it's literally the user base that determines 
development (or lack of it - non-development is often the big reason for a 
user base, as anybody who works for a distribution maintainer should know 
intimately).

			Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 20:02                                 ` [Ext2-devel] " Linus Torvalds
@ 2006-06-10 21:26                                   ` Theodore Tso
  2006-06-10 21:31                                     ` Linus Torvalds
                                                       ` (2 more replies)
  0 siblings, 3 replies; 296+ messages in thread
From: Theodore Tso @ 2006-06-10 21:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, cmm, linux-fsdevel, Kyle Moffett, Alex Tomas,
	Andreas Dilger

On Sat, Jun 10, 2006 at 01:02:26PM -0700, Linus Torvalds wrote:
> Starting from scratch - even if you literally start from the same 
> code-base - and allowing the old functionality to remain undisturbed is 
> just a very nice model. Yeah, yeah, it has some diskspace cost (although 
> at least from a git perspective, even that isn't really true), but we've 
> seen both in drivers and in filesystems how splitting things up has been a 
> great thing to do.
> 
> Sometimes it's a great thing just because five years later, it turns out 
> that nobody even uses the legacy thing, and you decide to at that point 
> just remove the driver (or filesystem, but so far it's never been the 
> case for filesystems even if smbfs is a potential victim of this in the 
> not _too_ distant future), because the new version simply does everything 
> better.

	So you you would be in OK of a model where we copy fs/ext3 to
"fs/ext4", and do development there which would merged rapidly into
mainline so that people who want to participate in testing can use
ext3dev, while people who want stability can use ext3 --- and at some
point, we remove the old ext3 entirely and let fs/ext4 register itself
as both the ext3 and ext4 filesystem, and at some point in the future,
remove the ext3 name entirely?

	If that allows us to make forward progress and stop the
flamewar, I'm willing to go along with it --- although e2fsprogs will
continue to support ext2/3/4, and ext4 will have backwards
compatibility support for ext3 formats (we can look at better ways of
refactoring code to make it cleaner, if people don't like the current
conditions).  There are some real advantages to the system, especially
if we can get changed merged into mainline for ext4 more quickly while
it is under development and declared to be unstable (we can put it
under CONFIG_EXPERIMENTAL if people really want).  

	As far as people who want to use ext3 as the beginning point
to do something that is has no forwards- compatibility, there's
nothing stopping them from creating a jgarzikfs if they want.  But I
think I can speak for most of the ext3 development community that we
feel that one of the strengths of ext2/3 is its ability to do smooth
upgrades (and in many cases, downgrades as well, when people need to
migrate a filesystem so it can be mounted on older kernels), and that
it's one of the reasons why ext3 has been more succesful, than say,
JFS. 

	I do think there is plenty of room for competition, and I'm
certainly looking forward to the brainstorming at next week's
filesystem workshop.  But ext2/3 has been pretty successful for over
ten years given a certain development model and philosophy, and I for
one am interested how much farther we can take it.  Remember when
academics were saying that Linux was an obsolete design and
Microkernels was where it's at?  If we had given up 15 years ago when
Prof. Tennenbaum had said it, where would we be?

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 21:26                                   ` Theodore Tso
@ 2006-06-10 21:31                                     ` Linus Torvalds
  2006-06-10 22:12                                     ` Jeff Garzik
  2006-06-10 22:21                                     ` Jeff Garzik
  2 siblings, 0 replies; 296+ messages in thread
From: Linus Torvalds @ 2006-06-10 21:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Chase Venters, cmm, linux-fsdevel, Kyle Moffett, Alex Tomas,
	Andreas Dilger

On Sat, 10 Jun 2006, Theodore Tso wrote:
>
> 	So you you would be in OK of a model where we copy fs/ext3 to
> "fs/ext4", and do development there which would merged rapidly into
> mainline so that people who want to participate in testing can use
> ext3dev, while people who want stability can use ext3

Absolutely.

> --- and at some
> point, we remove the old ext3 entirely and let fs/ext4 register itself
> as both the ext3 and ext4 filesystem, and at some point in the future,
> remove the ext3 name entirely?

Maybe, and maybe not. That depends on where ext4 is when the thing calms 
down.

Look at what happened to ext2. Would you seriously suggest removing it 
just because ext3 does more than ext2 does?

And yes, if I recall correctly, all the same ext2 people were against the 
whole ext2->ext3 split also, which we did for the same reason - I and 
others refused to let people "hack on" the standard stable filesystem.

Yet I don't see anybody in this discussion saying "I admit I was wrong 
back then - the split was correct". Hmm. I wonder where those people went?

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 21:26                                   ` Theodore Tso
  2006-06-10 21:31                                     ` Linus Torvalds
@ 2006-06-10 22:12                                     ` Jeff Garzik
  2006-06-10 22:21                                     ` Jeff Garzik
  2 siblings, 0 replies; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10 22:12 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, ext2-devel, linux-kernel, Chase Venters,
	Linus Torvalds, cmm, linux-fsdevel, Kyle Moffett, Alex Tomas,
	Andreas Dilger

Theodore Tso wrote:
> 	As far as people who want to use ext3 as the beginning point
> to do something that is has no forwards- compatibility, there's
> nothing stopping them from creating a jgarzikfs if they want.  But I
> think I can speak for most of the ext3 development community that we
> feel that one of the strengths of ext2/3 is its ability to do smooth
> upgrades (and in many cases, downgrades as well, when people need to
> migrate a filesystem so it can be mounted on older kernels), and that
> it's one of the reasons why ext3 has been more succesful, than say,
> JFS. 

When did I ever say smooth upgrades were a bad idea?

The whole point of 'cp -a ext3 ext4' is to ensure smooth upgrades 
continue.  A key theme is to avoid -backporting- all this new stuff 
that's going into ext4.  IMO ext3 shouldn't be a devel platform at this 
point in its lifecycle.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 21:26                                   ` Theodore Tso
  2006-06-10 21:31                                     ` Linus Torvalds
  2006-06-10 22:12                                     ` Jeff Garzik
@ 2006-06-10 22:21                                     ` Jeff Garzik
  2006-06-11  4:39                                       ` Stable/devel policy - was Re: [Ext2-devel] " Neil Brown
  2 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-10 22:21 UTC (permalink / raw)
  To: Theodore Tso, Linus Torvalds, Kyle Moffett, Jeff Garzik,
	Chase Venters, Alex Tomas, Andreas Dilger, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

Theodore Tso wrote:
> 	So you you would be in OK of a model where we copy fs/ext3 to
> "fs/ext4", and do development there which would merged rapidly into
> mainline so that people who want to participate in testing can use
> ext3dev, while people who want stability can use ext3 --- and at some
> point, we remove the old ext3 entirely and let fs/ext4 register itself
> as both the ext3 and ext4 filesystem, and at some point in the future,
> remove the ext3 name entirely?

Yep, and in addition I would argue that you can take the opportunity to 
make ext4 default to extents-enabled, and some similar behavior changes 
(dir_index default?).  The existence of both ext3 and ext4 means you can 
be more aggressive in turning on stuff, IMO.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 14:42                 ` Ingo Molnar
                                     ` (2 preceding siblings ...)
  2006-06-10 16:05                   ` Christoph Hellwig
@ 2006-06-10 23:05                   ` Mike Galbraith
  3 siblings, 0 replies; 296+ messages in thread
From: Mike Galbraith @ 2006-06-10 23:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, Adrian Bunk,
	Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger

On Sat, 2006-06-10 at 16:42 +0200, Ingo Molnar wrote:
> frankly, i'll leave that decision to the ext3 developers and obviously, 
> to distributors. Their filesystem has handled my data for 10 years, and 
> they have been very conservative about their technical choices 
> throughout. I trust them to not mess up this time either.

That's my view in nut shell (minus distributors).  Add caps.

	-Mike 

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Stable/devel policy - was Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-10 22:21                                     ` Jeff Garzik
@ 2006-06-11  4:39                                       ` Neil Brown
  2006-06-11  5:19                                         ` Stable/devel policy - was " Linus Torvalds
  2006-06-13  0:28                                         ` Stable/devel policy - was Re: [Ext2-devel] " Mingming Cao
  0 siblings, 2 replies; 296+ messages in thread
From: Neil Brown @ 2006-06-11  4:39 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Theodore Tso, Linus Torvalds, Kyle Moffett, Chase Venters,
	Alex Tomas, Andreas Dilger, Andrew Morton, ext2-devel,
	linux-kernel, cmm, linux-fsdevel

On Saturday June 10, jeff@garzik.org wrote:
> Theodore Tso wrote:
> > 	So you you would be in OK of a model where we copy fs/ext3 to
> > "fs/ext4", and do development there which would merged rapidly into
> > mainline so that people who want to participate in testing can use
> > ext3dev, while people who want stability can use ext3 --- and at some
> > point, we remove the old ext3 entirely and let fs/ext4 register itself
> > as both the ext3 and ext4 filesystem, and at some point in the future,
> > remove the ext3 name entirely?
> 
> Yep, and in addition I would argue that you can take the opportunity to 
> make ext4 default to extents-enabled, and some similar behavior changes 
> (dir_index default?).  The existence of both ext3 and ext4 means you can 
> be more aggressive in turning on stuff, IMO.
> 
> 	Jeff

I'm wondering what all this has to say about general principles of
sub-project development with the Linux kernel.

There is a strong tradition of software projects having a 'stable'
branch and a 'development' branch, and having both available and both
receiving bug fixes (at least) so that users can choose what best
suits their needs.

Due to the (quite appropriate) lack of a stable API for kernel
modules, it isn't really practical (and definitely isn't encouraged)
to distribute kernel-modules separately.  This seems to suggest that
if we want a 'stable' and a 'devel' branch of a project, both branches
need to be distributed as part of the same kernel tree.

Apart from ext2/3 - and maybe reiserfs - there doesn't seem to be much
evidence of this happening.  Why is that?

 - is -mm enough?  It seems to be enough for small updates, but
   doesn't seem to be enough for more major projects.  How long
   have the ext3 patches been in -mm?? (I cannot actually seem
   to find them there at all)

 - is there lots of -devel code slipping in to the 'stable' tree, thus
   resulting in a kernel.org tree that is permanently unstable (in
   which case there should be no objection to the new ext3 code -
   leave it to distros to keep it out until it is stable).

 - are we just not innovating as much as we could be and so don't
   need a -devel? Is ext3 the only site of major innovation?
   Seems unlikely.

It seems a bit rough to insist that the ext-fs fork every so-often,
but not impose similar requirements on other sections of code.

So: what would you (collectively) suggest should be the policy for
managing substantial innovation within Linux subsystems?  And how
broadly should it be applied?

NeilBrown

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: Stable/devel policy - was Re: [RFC 0/13] extents and 48bit ext3
  2006-06-11  4:39                                       ` Stable/devel policy - was Re: [Ext2-devel] " Neil Brown
@ 2006-06-11  5:19                                         ` Linus Torvalds
  2006-06-11  7:32                                           ` Ingo Molnar
  2006-06-13  0:28                                         ` Stable/devel policy - was Re: [Ext2-devel] " Mingming Cao
  1 sibling, 1 reply; 296+ messages in thread
From: Linus Torvalds @ 2006-06-11  5:19 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Theodore Tso, Jeff Garzik, ext2-devel,
	linux-kernel, Chase Venters, cmm, linux-fsdevel, Kyle Moffett,
	Alex Tomas, Andreas Dilger

On Sun, 11 Jun 2006, Neil Brown wrote:
> 
> I'm wondering what all this has to say about general principles of
> sub-project development with the Linux kernel.

Yes. That's an interesting and relevant tangent.

> Due to the (quite appropriate) lack of a stable API for kernel
> modules, it isn't really practical (and definitely isn't encouraged)
> to distribute kernel-modules separately.  This seems to suggest that
> if we want a 'stable' and a 'devel' branch of a project, both branches
> need to be distributed as part of the same kernel tree.
> 
> Apart from ext2/3 - and maybe reiserfs - there doesn't seem to be much
> evidence of this happening.  Why is that?

I think part of it is "expense". It's pretty expensive to maintain on a 
bigger scale. For example, you mention "-mm", and there's no question that 
it's _very_ expensive to do that (ie you basically need a very respected 
person who must be spending a fair amount of effort and time on it).

Even in this case, I think a large argument has been that ext3 itself 
isn't getting a lot of active development outside of the suggested ext4 
effort, so the "expense" there is literally just the copying of the files. 
That works ok for a filesystem every once in a while, but it wouldn't 
scale to _everybody_ doing it often. 

Also, in order for it to work at all, it obviously needs to be a part of 
the kernel that -can- be duplicated. That pretty much means "filesystem" 
or "device driver". Other parts aren't as amenable to having multiple 
concurrent versions going on at the same time (although it clearly does 
happen: look at the IO schedulers, where a large reason for the pluggable 
IO scheduler model was to allow multiple independent schedulers exactly so 
that people _could_ do different ones in parallel).

People have obviously suggested pluggable CPU schedulers too, and even 
more radically pluggable VM modules (not that long ago).

> It seems a bit rough to insist that the ext-fs fork every so-often,
> but not impose similar requirements on other sections of code.

Well, as mentioned, it's actually quite common in drivers. It's clearly 
not the _main_ development model, but it's happened several times in 
almost every single driver subsystem (ie SCSI drivers, video drivers, 
network drivers, USB, IDE, have _all_ seen "duplicated" drivers where 
somebody just decided to do things differently, and rather than extend an 
existing driver, do an alternate one).

So it's not like this is _exceptional_. It happens all the time. It 
obviously happens less than normal development (we couldn't fork things 
every time something changes), but it's not unheard of, or even rare.

> So: what would you (collectively) suggest should be the policy for
> managing substantial innovation within Linux subsystems?  And how
> broadly should it be applied?

I think the interesting point is how we're moving away from the "global 
development" model (ie everything breaks at the same time between 2.4.x 
and 2.6.x), and how the fact that we're trying to maintain a more stable 
situation may well mean that we'll see more of the "local development" 
model where a specific subsystem goes through a development series, but 
where stability requirements mean that we must not allow it to disturb 
existing users.

And even more interestingly (at least to me), the question might become 
one of "how does that affect the tools and build and configuration 
infrastructure", and just the general flow of development.

I don't think one or two filesystems (and a few drivers) splitting is 
anythign new, but if this ends up becoming _more_ common, maybe that 
implies a new model entirely..

		Linus

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-10 15:03                   ` Jeff Garzik
@ 2006-06-11  6:00                     ` Ingo Molnar
  0 siblings, 0 replies; 296+ messages in thread
From: Ingo Molnar @ 2006-06-11  6:00 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, ext2-devel, linux-kernel, Adrian Bunk,
	Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
	Andreas Dilger


* Jeff Garzik <jeff@garzik.org> wrote:

> I agree with your point in the thread -- most users and distros don't 
> change their main fs on a whim.  But I also point out that these 
> extent+48bit changes _do_ change the format in an incompatible way...

yeah. /me learns to not post too much while watching football ;)

	Ingo

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: Stable/devel policy - was Re: [RFC 0/13] extents and 48bit ext3
  2006-06-11  5:19                                         ` Stable/devel policy - was " Linus Torvalds
@ 2006-06-11  7:32                                           ` Ingo Molnar
  0 siblings, 0 replies; 296+ messages in thread
From: Ingo Molnar @ 2006-06-11  7:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Theodore Tso, Jeff Garzik, Neil Brown, ext2-devel,
	linux-kernel, Chase Venters, cmm, linux-fsdevel, Kyle Moffett,
	Alex Tomas, Andreas Dilger

* Linus Torvalds <torvalds@osdl.org> wrote:

> And even more interestingly (at least to me), the question might 
> become one of "how does that affect the tools and build and 
> configuration infrastructure", and just the general flow of 
> development.
> 
> I don't think one or two filesystems (and a few drivers) splitting is 
> anythign new, but if this ends up becoming _more_ common, maybe that 
> implies a new model entirely..

at least for core kernel stuff, it's hard to split things in any 
manageable way (as you mentioned it as well) - so higher flux is 
inevitable.

So what i've been focusing on more in the past year or so is to enable 
the core kernel to take more development flux, via kernel features.

Instead of adding more features to the kernel, i'm quite interested in 
seeing more technologies that make a higher development flux safer: to
make the kernel more debuggable, to make bugs more reportable for users,
to make the effects of bugs less harmful, and to make the kernel itself
notice more bugs by itself.

To be able to handle a higher development flux in core code, i think we 
need the following policies wrt. core kernel changes:

 - More code consolidation between architectures and subsystems.

   Core kernel changes impact "non-mainstream" architectures the most - 
   while some of our best technologies root from non-mainstream 
   technologies. So it's a net loss to only concentrate on the 
   mainstream, because developer and technology distribution does not 
   follow user distribution.

   The generic irq subsystem, spinlock and semaphore/mutex consolidation 
   are all efforts in this direction. I consider the Generic Time Of Day 
   (GTOD) effort a similarly important item, for the same reasons. There 
   are other good examples too, for example klibc is a good step towards 
   a more consolidated boot process. The Xen subarch work triggers 
   consolidation too - etc. Andrew's policy of "you must not break _any_ 
   architecture in -mm" is very important too.

   And we should do consolidation even in cases where there's some
   minimal runtime cost. Being able to handle higher flux is more 
   important than getting the last cycle out of the system. This does
   not mean we should reject patches that do get those last cycles, this 
   only means we should not reject consolidation patches on the grounds 
   that they _lose_ a few cycles. I dont think this is a common problem 
   for consolidation projects right now - but it could happen in the 
   future.

 - Even more cleanups.

   We always preferred cleanups but it now becomes critical: i strongly 
   believe that cleanups must take precedence over feature work. [with a 
   few rare and temporary exceptions perhaps, like hardware-enablement 
   or really critical features.] It's much easier to spot bugs in clean 
   code, plus it's much easier for automated correctness validators to 
   find bugs in clean code.

   (My own examples here include spinlock-init cleanups, which directly
   enabled things like the lock validator. But pure code cleanups apply 
   too. )

 - More automated correctness-checking tools and kernel features.

   While the preferred mode of avoiding bugs should be a clean 
   design and clean code, higher flux introduces higher noise and bugs 
   are inevitable. So the importance of automated tools (both static and 
   dynamic analysis) increased.

   Sparse annotations are one good example. My own examples here are the
   lock validator, the mutex debugging code, the consolidated
   spinlock debugging code. Some of these are direct feature-enablers: 
   for example the smp_processor_id() debugging code directly enabled a 
   safe and painless migration to PREEMPT_BKL. One nice feature in the 
   works that can find hard-to-spot bugs is kmemleak.

 - Coding style police!

   With higher development flux it is becoming even more important for 
   kernel developers to review other developer's work. But that is very 
   hard if the coding style varies too much. This is a fundamentally 
   human problem, and the only sane solution is brutal: the _strict_ 
   Linus coding style must be used in all high-flux subsystems.

 - More debuggability, reportability.

   In this area we still suck quite a bit, and this affects userspace
   too: currently we have nothing equivalent to things like Dr Watson,
   in Linux most of the info about the first userspace crash almost 
   always gets lost! (and even afterwards, once debug packages are 
   downloaded and the app is run in gdb, it's still too painful for the 
   user, so we lose lots of feedback.)

   Some of the GUIs try to do something about this and automate crash 
   reporting, but it doesnt cover most of the app crashes and userspace 
   clearly needs kernel help, because ptrace is too inflexible for this 
   purpose. (help is on the way though, there's a next-gen ptrace 
   project that solves these problems very cleanly.)

   There are a number of important projects going on in this area - for 
   example the dwarf unwinder for x86_64 to improve the quality of 
   kernel oopses, and kgdb (or bits of NLKD) if it gets clean enough.

my own impression is that things are going in the right direction, but
that there should be more awareness of these principles. I think if we
add a couple of more key technologies then we can take the higher kernel
development flux just fine, without compromising quality. Even though
Linux has lots of developers, we should be more economic with that
development power and should waste less of that on unnecessarily complex
debugging tasks.

I do consider the forking of a subsystem the "easy way out" - the hard 
and more correct approach is i think to turn every drastic rewrite into 
small manageable steps. That's much easier said than done, and it's 
sometimes 10 times the work but it's alot safer - and the end result is 
often wildly different (and alot cleaner!) from what one would do via a 
drastic rewrite. A dumb 'cp -a' copying of a subsystem will preserve 
most of the legacies and architectural inefficiencies. Even an 
intelligent drastic rewrite preserves most of the legacies - there's 
just so much of change users can take at once, and _eventually_ a new 
subsystem has to be exposed to real users - at which point the 
compatibility constraints apply again. I have yet to see a single case 
of hard physical necessity to throw away an old subsystem due to 
legacies. I think the prime example to follow is how Al Viro works - 
he's beein maintaining the VFS for many years without having to 
duplicate functionality, without breaking the world, but he still 
managed to turn the VFS upside down, inside out, in small, manageable 
steps. It _is_ possible in almost every case, for all but the most 
spaghetti pieces of code.

	Ingo

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
@ 2006-06-11  8:22 linux
  0 siblings, 0 replies; 296+ messages in thread
From: linux @ 2006-06-11  8:22 UTC (permalink / raw)
  To: akpm, linux-fsdevel, linux-kernel

> We seem to be lagging behind "the industry" in some areas - handling large
> devices, high bandwidth IO, sophisticated on-disk data structures, advanced
> manageability, etc.

Er... I would like to point out that "sophisticated on-disk data
structures" are, in and of themselves, a Bad Thing.  It's only when
they provide some desirable capability that they earn their cost in
implementation difficulty, code size, and bug rate.

ZFS is interesting, and I Really Really Like its reliability guarantees,
but I notice that, due to the append-only nature of its operation,
it's extraordinarily difficult to move data once it's been written.
This makes migrating a file system off of old nasty disks to big new
disks rather annoying.  If you know before you add the new drives, you
can physically mirror the old disks and avoid changing block pointers,
but I'd wish for something more flexible.

Because block pointers are physical, and all checksummed, moving a
single block requires rewriting the root block of every snapshot that
contains that block.  Now, you can keep an index of "old block X is now
in new location Y" while walking the entire file system until you're
sure that all the old pointers are gone, but it's hard to preallocate
that index, because you also have to know that "old pointer block X
has been recreated at new location Y, but its contents are different;
only the logical content is the same", and there's no obvious way to
bound the number of such forwarding notes that need to be made.

You must have such an index, or you can't preserve sharing while you
migrate the data.

H'm... for sane efficiency, you also need to keep track of all metadata
blocks that have been examined and NOT changed, so when you hit them again
traversing the file system structure DAG, you know that you can stop.
Between the two, this amounts to every metadata block on the file system.
Wow!

Well, at least that gives you an upper limit on the size needed.
One block forwarding entry per data block on the migrated-from disk,
plus one index-forwarding entry (which may be larger, if it contains
the new block checksum) for each index block on the entire file system.

Ouch.

(And, of course, all of this has to be done on a live file system.)

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:44                       ` Alan Cox
@ 2006-06-11 15:52                         ` Arjan van de Ven
  0 siblings, 0 replies; 296+ messages in thread
From: Arjan van de Ven @ 2006-06-11 15:52 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, Jeff Garzik, adilger, torvalds, alex, ext2-devel,
	linux-kernel, cmm, linux-fsdevel

On Fri, 2006-06-09 at 21:44 +0100, Alan Cox wrote:
> OTOH the number of complaints about this is minimal, people want to go
> forwards in a controlled manner not backwards.

well... they want to be able to go "a little bit" backwards; say one
version of an OS (6 months). Eg the scenario that ought to work is "go
to newer version, hate it, go back". But yes that's a limited time to go
back, not the "go back to 2.2" kind of "go back".


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:51                       ` Jeff Garzik
  2006-06-09 19:39                         ` Gerrit Huizenga
  2006-06-09 19:49                         ` [Ext2-devel] " Theodore Tso
@ 2006-06-11 16:02                         ` Arjan van de Ven
  2006-06-11 16:30                           ` Nikita Danilov
  2006-06-12  6:35                           ` Andreas Dilger
  2006-06-12 22:06                         ` [Ext2-devel] " Pavel Machek
  3 siblings, 2 replies; 296+ messages in thread
From: Arjan van de Ven @ 2006-06-11 16:02 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote:
> PRECISELY.  So you should stop modifying a filesystem whose design is 
> admittedly _not_ modern!
> 
> ext3 is already essentially xiafs-on-life-support, when you consider 
> today's large storage systems and today's filesystem technology.  Just 
> look at the ugly hacks needed to support expanding an ext3 filesystem 
> online.


actually I think I disagree with you. One thing I've noticed over the
years is that ext2 layout has one thing going for it: it is simple and
robust. Maybe "ext2 layout" is the wrong word, "block bitmap and
direct/indirect block based" may be better. It seems that once you go
into tree space (and I would call htree a borderline thing there) you
get both really complex code and fragile behavior all over (mostly in
terms of "when something goes wrong")


^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-11 16:02                         ` Arjan van de Ven
@ 2006-06-11 16:30                           ` Nikita Danilov
  2006-06-11 16:55                             ` [Ext2-devel] " Arjan van de Ven
  2006-06-12  6:35                           ` Andreas Dilger
  1 sibling, 1 reply; 296+ messages in thread
From: Nikita Danilov @ 2006-06-11 16:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Matthew Frost, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas

Arjan van de Ven writes:
 > On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote:
 > > PRECISELY.  So you should stop modifying a filesystem whose design is 
 > > admittedly _not_ modern!
 > > 
 > > ext3 is already essentially xiafs-on-life-support, when you consider 
 > > today's large storage systems and today's filesystem technology.  Just 
 > > look at the ugly hacks needed to support expanding an ext3 filesystem 
 > > online.
 > 
 > 
 > actually I think I disagree with you. One thing I've noticed over the
 > years is that ext2 layout has one thing going for it: it is simple and
 > robust. Maybe "ext2 layout" is the wrong word, "block bitmap and
 > direct/indirect block based" may be better. It seems that once you go
 > into tree space (and I would call htree a borderline thing there) you
 > get both really complex code and fragile behavior all over (mostly in
 > terms of "when something goes wrong")

Huh? Direct/indirect/double-indirect/... _is_ a tree, albeit not
balanced one. What makes s5fs/ffs/ufs/ext* so exceptionally robust is
fixed position of inode tables, which provides a guaranteed starting
point for fsck under almost any circumstances.

Nikita.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-11 16:30                           ` Nikita Danilov
@ 2006-06-11 16:55                             ` Arjan van de Ven
  0 siblings, 0 replies; 296+ messages in thread
From: Arjan van de Ven @ 2006-06-11 16:55 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

On Sun, 2006-06-11 at 20:30 +0400, Nikita Danilov wrote:
> Arjan van de Ven writes:
>  > On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote:
>  > > PRECISELY.  So you should stop modifying a filesystem whose design is 
>  > > admittedly _not_ modern!
>  > > 
>  > > ext3 is already essentially xiafs-on-life-support, when you consider 
>  > > today's large storage systems and today's filesystem technology.  Just 
>  > > look at the ugly hacks needed to support expanding an ext3 filesystem 
>  > > online.
>  > 
>  > 
>  > actually I think I disagree with you. One thing I've noticed over the
>  > years is that ext2 layout has one thing going for it: it is simple and
>  > robust. Maybe "ext2 layout" is the wrong word, "block bitmap and
>  > direct/indirect block based" may be better. It seems that once you go
>  > into tree space (and I would call htree a borderline thing there) you
>  > get both really complex code and fragile behavior all over (mostly in
>  > terms of "when something goes wrong")
> 
> Huh? Direct/indirect/double-indirect/... _is_ a tree, albeit not
> balanced one.

ok sure; the main strength is that it is not a dynamic tree.



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:35                                 ` Alex Tomas
  2006-06-09 19:35                                   ` [Ext2-devel] " Jeff Garzik
  2006-06-09 20:44                                   ` Joel Becker
@ 2006-06-11 20:14                                   ` grundig
  2006-06-14 16:45                                     ` Alex Tomas
  2 siblings, 1 reply; 296+ messages in thread
From: grundig @ 2006-06-11 20:14 UTC (permalink / raw)
  To: Alex Tomas
  Cc: jeff, alex, alan, chase.venters, torvalds, adilger, akpm,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

El Fri, 09 Jun 2006 23:35:43 +0400,
Alex Tomas <alex@clusterfs.com> escribió:

> >>>>> Jeff Garzik (JG) writes:
> 
>  JG> Irrelevant.  That's a development-only situation.  It will be enabled
>  JG> by default eventually, and should be considered in that light.
> 
> that's your point of view. mine is that this option (and code)
> to be used only when needed. 

Distros may ignore your opinion and may enable it, and users won't know
that it's enabled or even if such feature exist - until they try to run
an older kernel. If almost nobody needs this feature, why not avoid
problems by not merging it and maintaining it separated from the
main tree?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-11 16:02                         ` Arjan van de Ven
  2006-06-11 16:30                           ` Nikita Danilov
@ 2006-06-12  6:35                           ` Andreas Dilger
  1 sibling, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-12  6:35 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas

On Jun 11, 2006  18:00 +0200, Arjan van de Ven wrote:
> On Fri, 2006-06-09 at 21:44 +0100, Alan Cox wrote:
> > OTOH the number of complaints about this is minimal, people want to go
> > forwards in a controlled manner not backwards.
> 
> well... they want to be able to go "a little bit" backwards; say one
> version of an OS (6 months). Eg the scenario that ought to work is "go
> to newer version, hate it, go back". But yes that's a limited time to go
> back, not the "go back to 2.2" kind of "go back".

Interestingly, one of the reasons we want(ed) to get the extents code into
the ext3 mainline ASAP is that this would allow it to be available for the
"go back" phase when (in a couple of years) you NEED to have support for
gigantic block devices and have no choice but use this code to update.
For today it would only be used by people who really want to use it.

On Jun 11, 2006  18:02 +0200, Arjan van de Ven wrote:
> On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote:
> > PRECISELY.  So you should stop modifying a filesystem whose design is 
> > admittedly _not_ modern!
> > 
> > ext3 is already essentially xiafs-on-life-support, when you consider 
> > today's large storage systems and today's filesystem technology.  Just 
> > look at the ugly hacks needed to support expanding an ext3 filesystem 
> > online.
> 
> actually I think I disagree with you. One thing I've noticed over the
> years is that ext2 layout has one thing going for it: it is simple and
> robust. Maybe "ext2 layout" is the wrong word, "block bitmap and
> direct/indirect block based" may be better. It seems that once you go
> into tree space (and I would call htree a borderline thing there) you
> get both really complex code and fragile behavior all over (mostly in
> terms of "when something goes wrong")

You're correct in calling htree a borderline case, because the directory
metadata is still accessible in a "linear" manner if the tree is corrupted
for some reason.  I've recently been thinking of making the structure even
more robust by encoding a singly- or doubly-linked list into the directory
leaf blocks.

However, in the direct/indirect block tree is the most fragile part of
ext2/ext3.  It also has the bad effect that corruption in the file indirect
tree can easily amplify into widespread filesystem corruption because wrongly
freeing indirect block and reallocating it will potentially cause 1024 more
blocks to be freed when that indirect block is unlinked, etc.  This is also
the slowest part of e2fsck checking if it detects corruption (duplication)
in the block allocation.

When we had very small filesystems it was easy to tell if an
indirect block was corrupt, because the valid block numbers made up only
a small fraction of the 2^32 possible block numbers.  However, with large
filesystems valid block numbers make up a large fraction of the 2^32 block
number space.  As we get to 16TB filesystems it is impossible to tell when
an indirect block is filled with garbage and when it is valid.

One of the features of the extent format is that firstly it has a magic
number in each "indirect" block (called an extent index block).  Secondly,
there is enough redundancy that it allows internal validation of the extent
data (e.g. that extents are sequentially increasing logical offsets, that
the parent's logical offset is correctly "encompassing" all of the leaf's
logical offsets.

Finally, one of the features that has been designed into the extent format
(though not yet implemented) is that it is possible to add a checksum to
each extent index to verify the metadata more strongly.  There will also
be space to have a back-pointer to the parent inode for validation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-09 19:57                   ` Theodore Tso
  2006-06-09 20:09                     ` Jeff Garzik
  2006-06-09 20:38                     ` Joel Becker
@ 2006-06-12  8:58                     ` Jes Sorensen
  2 siblings, 0 replies; 296+ messages in thread
From: Jes Sorensen @ 2006-06-12  8:58 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
	Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger

>>>>> "Ted" == Theodore Tso <tytso@mit.edu> writes:

Ted> On Fri, Jun 09, 2006 at 12:55:09PM -0400, Jeff Garzik wrote:
>> 1) clone a new tree 2) cp -a fs/ext3 fs/ext4 3) apply extent and
>> 48bit patches 4) apply related e2fsprogs patches
>> 
>> Then update ext4 step-by-step, using the normal Linux development
>> process.

Ted> We don't do this with the SCSI layer where we make a complete
Ted> clone of the driver layer so that there is a
Ted> /usr/src/linux/driver/scsi and /usr/src/linux/driver/scsi2, do
Ted> we?  And we didn't do that with the networking layer either, as
Ted> we added ipsec, ipv6, softnet, and a whole host of other changes
Ted> and improvements.

Maybe it's just me, but I am reading oranges vs apples there. The SCSI
comparison is like suggesting we go from the VFS to a VFS2 or fs/ ->
fs2/ for this. On the other hand going from ext3 -> ext4 to get
something incompatible (like enabling extends or 48 bit) is similar to
going net/ipv4 -> net/ipv6, which we did do indeed.

Fact of the matter is that 2.4 is dead or at least frozen solid by
now. The userland of most distros today wouldn't be able to boot with
a 2.4 kernel anyway.

Granted I am not a filesystem expert, but personally I would feel more
comfortable deciding to put my data on an ext4 file system knowing
that it was just that, rather than a hybrid.

Jes

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 18:51                       ` Jeff Garzik
                                           ` (2 preceding siblings ...)
  2006-06-11 16:02                         ` Arjan van de Ven
@ 2006-06-12 22:06                         ` Pavel Machek
  2006-06-14 14:31                           ` Barry K. Nathan
  3 siblings, 1 reply; 296+ messages in thread
From: Pavel Machek @ 2006-06-12 22:06 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel

Hi!

> >If ext2 and ext3 didn't support > 2GB files (which was 
> >a filesystem
> >feature added in exactly the same way as extents are 
> >today, and nobody
> >bitched about it then) then they would be relegated to 
> >the same status
> >as minix and xiafs and all the other filesystems that 
> >are stuck in the
> >"we can't change" or "we aren't supported" camps.
> 
> PRECISELY.  So you should stop modifying a filesystem 
> whose design is admittedly _not_ modern!
> 
> ext3 is already essentially xiafs-on-life-support, when 
> you consider today's large storage systems and today's 
> filesystem technology. 

Please don't. AFAIK, ext2/3 is only filesystem with working fsck
(because that fsck was actually needed in the old days). Starting from
xfs/jfs/reiser/??? means we no longer have working fsck...

-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: Stable/devel policy - was Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-11  4:39                                       ` Stable/devel policy - was Re: [Ext2-devel] " Neil Brown
  2006-06-11  5:19                                         ` Stable/devel policy - was " Linus Torvalds
@ 2006-06-13  0:28                                         ` Mingming Cao
  1 sibling, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-13  0:28 UTC (permalink / raw)
  To: Neil Brown
  Cc: Jeff Garzik, Theodore Tso, Linus Torvalds, Kyle Moffett,
	Chase Venters, Alex Tomas, Andreas Dilger, Andrew Morton,
	ext2-devel, linux-kernel, linux-fsdevel

On Sun, 2006-06-11 at 14:39 +1000, Neil Brown wrote:
> On Saturday June 10, jeff@garzik.org wrote:
> > Theodore Tso wrote:
> > > 	So you you would be in OK of a model where we copy fs/ext3 to
> > > "fs/ext4", and do development there which would merged rapidly into
> > > mainline so that people who want to participate in testing can use
> > > ext3dev, while people who want stability can use ext3 --- and at some
> > > point, we remove the old ext3 entirely and let fs/ext4 register itself
> > > as both the ext3 and ext4 filesystem, and at some point in the future,
> > > remove the ext3 name entirely?
> > 
> > Yep, and in addition I would argue that you can take the opportunity to 
> > make ext4 default to extents-enabled, and some similar behavior changes 
> > (dir_index default?).  The existence of both ext3 and ext4 means you can 
> > be more aggressive in turning on stuff, IMO.
> > 
> > 	Jeff
> 
> I'm wondering what all this has to say about general principles of
> sub-project development with the Linux kernel.
> 
> There is a strong tradition of software projects having a 'stable'
> branch and a 'development' branch, and having both available and both
> receiving bug fixes (at least) so that users can choose what best
> suits their needs.
> 
> Due to the (quite appropriate) lack of a stable API for kernel
> modules, it isn't really practical (and definitely isn't encouraged)
> to distribute kernel-modules separately.  This seems to suggest that
> if we want a 'stable' and a 'devel' branch of a project, both branches
> need to be distributed as part of the same kernel tree.
> 
> Apart from ext2/3 - and maybe reiserfs - there doesn't seem to be much
> evidence of this happening.  Why is that?
> 
>  - is -mm enough?  It seems to be enough for small updates, but
>    doesn't seem to be enough for more major projects.  How long
>    have the ext3 patches been in -mm?? (I cannot actually seem
>    to find them there at all)
> 

To clarify, the first 4 patches of the series are bug fixes for both 32
bit ext3 (with current on-disk layout) and 48 bit ext3(extents based),
they are in mm tree now. The rest of the patches 5-13 to support 48 bit
ext3 based on extents are not in mm tree.



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 17:58             ` Gerrit Huizenga
  2006-06-09 18:25               ` [Ext2-devel] " Chase Venters
  2006-06-10 13:46               ` Adrian Bunk
@ 2006-06-13 13:34               ` Helge Hafting
  2 siblings, 0 replies; 296+ messages in thread
From: Helge Hafting @ 2006-06-13 13:34 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
	ext2-devel, linux-kernel, cmm, linux-fsdevel, Andreas Dilger

Gerrit Huizenga wrote:
> On Fri, 09 Jun 2006 09:09:01 PDT, Linus Torvalds wrote:
>   
>> On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
>>     
>>> Jeff's approach taken to the rediculous would mean that we'd have
>>> ext versions 1-40 by now at least.  I don't think that helps much,
>>> either.
>>>       
>> On the other hand, I _guarantee_ you that it helps that we have ext2-3, 
>> and not just ext2 (nobody even tried to keep ext1 compatible, thank the 
>> Gods).
>>     
>  
> I had originally argued for ext4 as well based on the fact that it would
> allow lots of potential cleanups & simplifications and at the same time
> would allow a break in the on disk filesystems layout.
>
> These changes don't yet change the actual on-disk layout and that might
> be something that would be done if ext4 were a real, new filesystem.
>
> But then how long until ext4 is used enough to be put into production?
>   
No problem.  It didn't take long for ext3 - it won't take long for ext4.
First, you have developers and some enthusiasts using it.
Then, you get the thousands of people who like living
on the edge using ext4. As soon as it doesn't have bad known bugs.
Then some distros pick it up, wanting to be first with
large-disk support.
After that, it is considered "harmless".

If a break in on-disk layout is useful, then the time is now while
a new fs is introduced anyway.  It could be 7 years to the next chance.

Helge Hafting



^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-12 22:06                         ` [Ext2-devel] " Pavel Machek
@ 2006-06-14 14:31                           ` Barry K. Nathan
  2006-06-14 21:34                             ` [Ext2-devel] " Pavel Machek
  0 siblings, 1 reply; 296+ messages in thread
From: Barry K. Nathan @ 2006-06-14 14:31 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas

On 6/12/06, Pavel Machek <pavel@ucw.cz> wrote:
> Please don't. AFAIK, ext2/3 is only filesystem with working fsck
> (because that fsck was actually needed in the old days). Starting from
> xfs/jfs/reiser/??? means we no longer have working fsck...

Er, what do you mean by "working fsck"?

Unless I'm misunderstanding something, JFS also has a working fsck
(which has actually performed successful repair of real-world
filesystem corruption for me, although I haven't used it as much as
e2fsck or xfs_repair).

XFS's fsck is a no-op, but I think it could be implemented as a
wrapper around xfs_repair (and maybe xfs_check). xfs_repair has
successfully fixed corrupted filesystems for me, just as JFS's fsck
has.

(As for ReiserFS... well, in the past it's probably been too easy to
shoot yourself in the foot with reiserfsck and make the filesystem
worse-to-nonexistent instead of better. I haven't needed to use
reiserfsck on a corrupt FS lately so I don't know how it compares
these days.)
-- 
-Barry K. Nathan <barryn@pobox.com>

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-11 20:14                                   ` [Ext2-devel] " grundig
@ 2006-06-14 16:45                                     ` Alex Tomas
  0 siblings, 0 replies; 296+ messages in thread
From: Alex Tomas @ 2006-06-14 16:45 UTC (permalink / raw)
  To: grundig
  Cc: akpm, jeff, ext2-devel, linux-kernel, chase.venters, cmm,
	linux-fsdevel, Alex Tomas, adilger, alan

>>>>> grundig  (g) writes:

 g> Distros may ignore your opinion and may enable it, and users won't know
 g> that it's enabled or even if such feature exist - until they try to run
 g> an older kernel. If almost nobody needs this feature, why not avoid
 g> problems by not merging it and maintaining it separated from the
 g> main tree?

not sure, in such a distro such an user will be aware he's using ext4.
about "nobody needs ...": see my question regarding NUMA in kernel.

thanks, Alex

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-14 16:45                 ` [Ext2-devel] " Bodo Eggert
@ 2006-06-14 17:28                   ` Andreas Dilger
  0 siblings, 0 replies; 296+ messages in thread
From: Andreas Dilger @ 2006-06-14 17:28 UTC (permalink / raw)
  To: 7eggert
  Cc: akpm, jeff, ext2-devel, linux-kernel, chase.venters, torvalds,
	cmm, linux-fsdevel, grundig, alex, alan

On Jun 14, 2006  18:45 +0200, Bodo Eggert wrote:
> BTW: Upgrading a filesystem by using mount options _and_ forcing that
> option to be supplied on subsequent mounts is a BUG. If should be what
> current code demands, it should be fixed ASAP. I hope that's not what
> the current code does.

If you don't remount with "-o extents" all it (currently) means is that
new files will not be created with extents.  Existing extent-mapped files
will continue to work.  It was done this way so that if some serious
problem was found with extents there was a fallback position to "normal"
block mapped files and the damage would be limited to files created while
mounted with "-o extents".

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-14 14:31                           ` Barry K. Nathan
@ 2006-06-14 21:34                             ` Pavel Machek
  2006-06-15  0:28                               ` Barry K. Nathan
  0 siblings, 1 reply; 296+ messages in thread
From: Pavel Machek @ 2006-06-14 21:34 UTC (permalink / raw)
  To: Barry K. Nathan
  Cc: Jeff Garzik, Matthew Frost, Alex Tomas, Linus Torvalds,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

Hi!

> >Please don't. AFAIK, ext2/3 is only filesystem with 
> >working fsck
> >(because that fsck was actually needed in the old 
> >days). Starting from
> >xfs/jfs/reiser/??? means we no longer have working 
> >fsck...
> 
> Er, what do you mean by "working fsck"?

Passes 8 hours of me trying to intentionally break it with weird,
artifical disk corruption.

I even have script somewhere.

> Unless I'm misunderstanding something, JFS also has a 
> working fsck
> (which has actually performed successful repair of 
> real-world
> filesystem corruption for me, although I haven't used it 
> as much as
> e2fsck or xfs_repair).

...like, if it repaired 100 different, non-trivial corruptions, that
would be argument.

fsck.ext2 survives my torture (in some versions). fsck.vfat never
worked for me (likes to segfault), fsck.reiser never worked for me.

							Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-14 21:34                             ` [Ext2-devel] " Pavel Machek
@ 2006-06-15  0:28                               ` Barry K. Nathan
  2006-06-15  4:55                                 ` Theodore Tso
  2006-06-15  9:15                                 ` Pavel Machek
  0 siblings, 2 replies; 296+ messages in thread
From: Barry K. Nathan @ 2006-06-15  0:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jeff Garzik, Matthew Frost, Alex Tomas, Linus Torvalds,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

On 6/14/06, Pavel Machek <pavel@ucw.cz> wrote:
> Passes 8 hours of me trying to intentionally break it with weird,
> artifical disk corruption.
>
> I even have script somewhere.

Ok, thanks for clarifying.

> > Unless I'm misunderstanding something, JFS also has a
> > working fsck
> > (which has actually performed successful repair of
> > real-world
> > filesystem corruption for me, although I haven't used it
> > as much as
> > e2fsck or xfs_repair).
>
> ...like, if it repaired 100 different, non-trivial corruptions, that
> would be argument.

In the case of XFS, I've repaired maybe two dozen (or so) corruptions
that might be non-trivial (in most of the cases, the filesystem
wouldn't even mount before the repair).

> fsck.ext2 survives my torture (in some versions). fsck.vfat never
> worked for me (likes to segfault), fsck.reiser never worked for me.

BTW, I actually have a test filesystem here (an e2image from an actual
filesystem I encountered once) that used to cause e2fsck 1.36/1.37 to
segfault. Strangely, more ancient versions (like what ships in Red Hat
7.2) were able to repair it without segfaulting. In a few days, once
other stuff calms down for me, I need to revisit that and see if the
bug still exists with 1.39.
-- 
-Barry K. Nathan <barryn@pobox.com>

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-15  0:28                               ` Barry K. Nathan
@ 2006-06-15  4:55                                 ` Theodore Tso
  2006-06-15  7:43                                   ` Barry K. Nathan
  2006-06-15  9:15                                 ` Pavel Machek
  1 sibling, 1 reply; 296+ messages in thread
From: Theodore Tso @ 2006-06-15  4:55 UTC (permalink / raw)
  To: Barry K. Nathan; +Cc: ext2-devel, linux-kernel, linux-fsdevel

On Wed, Jun 14, 2006 at 05:28:31PM -0700, Barry K. Nathan wrote:
> BTW, I actually have a test filesystem here (an e2image from an actual
> filesystem I encountered once) that used to cause e2fsck 1.36/1.37 to
> segfault. Strangely, more ancient versions (like what ships in Red Hat
> 7.2) were able to repair it without segfaulting. In a few days, once
> other stuff calms down for me, I need to revisit that and see if the
> bug still exists with 1.39.

Please try it with 1.39; if it still crashes, let me know --- I treat
any filesystem corruptions that causes e2fsck to crash or which e2fsck
can't fix in a single pass to be a bug.  I'm guessing though that this
was probably this bug which was fixed right after 1.38 released (some
distributions did have the fix, but it's in the mainline e2fsprogs
starting with 1.39):

2005-07-04  Theodore Ts'o  <tytso@mit.edu>

	* pass2.c (e2fsck_process_bad_inode): Fixed bug which could cause
		e2fsck to core dump if a disconnected inode contained an
		extended attribute.  This was actually caused by two bugs.
		The first bug is that if the inode has been fully fixed
		up, the code will attempt to remove the inode from the
		inode_bad_map without checking to see if this bitmap is
		present.  Since it is cleared at the end of pass 2, if
		e2fsck_process_bad_inode is called in pass 4 (as it is for
		disconnected inodes), this would result in a core dump.
		This bug was mostly hidden by a second bug, which caused
		e2fsck_process_bad_inode() to consider all inodes without
		an extended attribute to be not fixed.  (Addresses Debian
		Bug: #316736)

						- Ted

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-15  4:55                                 ` Theodore Tso
@ 2006-06-15  7:43                                   ` Barry K. Nathan
  0 siblings, 0 replies; 296+ messages in thread
From: Barry K. Nathan @ 2006-06-15  7:43 UTC (permalink / raw)
  To: Theodore Tso, Barry K. Nathan, ext2-devel, linux-kernel,
	linux-fsdevel

On 6/14/06, Theodore Tso <tytso@mit.edu> wrote:
> Please try it with 1.39; if it still crashes, let me know --- I treat
[snip]

1.39 fixes it. Cool!

However, http://e2fsprogs.sourceforge.net/ is still touting the "NEW"
e2fsprogs 1.38 release. I think it would be a good idea to update the
page...
-- 
-Barry K. Nathan <barryn@pobox.com>

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-15  0:28                               ` Barry K. Nathan
  2006-06-15  4:55                                 ` Theodore Tso
@ 2006-06-15  9:15                                 ` Pavel Machek
  2006-06-15  9:40                                   ` Barry K. Nathan
  1 sibling, 1 reply; 296+ messages in thread
From: Pavel Machek @ 2006-06-15  9:15 UTC (permalink / raw)
  To: Barry K. Nathan
  Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas

Hi!

> >Passes 8 hours of me trying to intentionally break it with weird,
> >artifical disk corruption.
> >
> >I even have script somewhere.
> 
> Ok, thanks for clarifying.

You can get a copy, it would be interesting to know how JFS/XFS does.

> >> Unless I'm misunderstanding something, JFS also has a
> >> working fsck
> >> (which has actually performed successful repair of
> >> real-world
> >> filesystem corruption for me, although I haven't used it
> >> as much as
> >> e2fsck or xfs_repair).
> >
> >...like, if it repaired 100 different, non-trivial corruptions, that
> >would be argument.
> 
> In the case of XFS, I've repaired maybe two dozen (or so) corruptions
> that might be non-trivial (in most of the cases, the filesystem
> wouldn't even mount before the repair).
> 
> >fsck.ext2 survives my torture (in some versions). fsck.vfat never
> >worked for me (likes to segfault), fsck.reiser never worked for me.
> 
> BTW, I actually have a test filesystem here (an e2image from an actual
> filesystem I encountered once) that used to cause e2fsck 1.36/1.37 to
> segfault. Strangely, more ancient versions (like what ships in Red Hat
> 7.2) were able to repair it without segfaulting. In a few days, once
> other stuff calms down for me, I need to revisit that and see if the
> bug still exists with 1.39.

It varies a bit bitween versions, but at least e2fsck has regression
test suite... I had nasty e2 corruption in past (suspend wrote 0 onto
strategic place in bitmaps) where it put filesystem in self-destruct
mode. e2fsck reported fixing the corruption, but did not really fix
it... e2fsck was fixed in the meantime.

(I also have way to corrupt ext2 in a way that basically can't be
repaired automatically. Deallocating free block bitmap and putting
data in freed space is an evil way to corrupt filesystem).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-15  9:15                                 ` Pavel Machek
@ 2006-06-15  9:40                                   ` Barry K. Nathan
  2006-06-15  9:50                                     ` [Ext2-devel] " Pavel Machek
  0 siblings, 1 reply; 296+ messages in thread
From: Barry K. Nathan @ 2006-06-15  9:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas

On 6/15/06, Pavel Machek <pavel@suse.cz> wrote:
> Hi!
>
> > >Passes 8 hours of me trying to intentionally break it with weird,
> > >artifical disk corruption.
> > >
> > >I even have script somewhere.
> >
> > Ok, thanks for clarifying.
>
> You can get a copy, it would be interesting to know how JFS/XFS does.

Ok, I would be interested in getting a copy. (Maybe it would be good
to post it in public so that other people can try it too.)
-- 
-Barry K. Nathan <barryn@pobox.com>

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-15  9:40                                   ` Barry K. Nathan
@ 2006-06-15  9:50                                     ` Pavel Machek
  0 siblings, 0 replies; 296+ messages in thread
From: Pavel Machek @ 2006-06-15  9:50 UTC (permalink / raw)
  To: Barry K. Nathan
  Cc: Jeff Garzik, Matthew Frost, Alex Tomas, Linus Torvalds,
	Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel

Hi!

> >> >Passes 8 hours of me trying to intentionally break it with weird,
> >> >artifical disk corruption.
> >> >
> >> >I even have script somewhere.
> >>
> >> Ok, thanks for clarifying.
> >
> >You can get a copy, it would be interesting to know how JFS/XFS does.
> 
> Ok, I would be interested in getting a copy. (Maybe it would be good
> to post it in public so that other people can try it too.)

It needs some hand-tuning to do maximum damage to the filesystem, yet
keeping filesystem "recognizable". It also depends on fsck returning
reasonable error codes...
								Pavel

#!/bin/bash
#
# fscktest
#
# Usage: 
#	 Make sure output is logged somewhere
#        First, run fscktest -p as root
#	 Then you can run fscktest as normal user...
#

prepare() {
	SIZE=100000
	echo "Creating file..."
	cat /dev/zero | head -c $[$SIZE*1024] > test
	echo "Making filesystem..."
	mkfs.$FS test
	echo "Mounting..."
	mount test -o loop /mnt || exit "Cant mount"
	echo "Copying files..."
	cp -a /bin /mnt
	cp -a /usr/bin /mnt
	cp -a /usr/src/linux /mnt
	echo "Syncing..."
	sync
	echo "Unmounting..."
	umount /mnt
	echo "Moving..."
	mv test fsck.okay
	echo "All done."
}

FS=ext2
if [ .$1 == .-p ]; then
	prepare
	exit
	fi
RUN=0
while true; do
	RUN=$[$RUN+1]
	echo "Run #$RUN"
	echo Preparing...
	cat fsck.okay > fsck.damaged
	echo Damaging...
	dd if=/dev/urandom of=fsck.damaged count=10240 seek=7 conv=notrunc
	cp fsck.damaged fsck.test
	echo First check...
	fsck.$FS -fy fsck.damaged
	RESULT=$?
	if [ $RESULT != 1 -a $RESULT != 2 -a $RESULT != 0 ]; then
		echo "Fsck failed in bad way (result = $RESULT)"
		exit
		fi
	echo Second check...
	fsck.$FS -fy fsck.damaged
	RESULT=$?
	if [ $RESULT != 0 ]; then
		echo "Fsck lied about its success (result = $RESULT)"
		exit
		fi
	done


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:14                       ` Alex Tomas
  2006-06-09 20:28                         ` Jeff Garzik
@ 2006-06-19  7:48                         ` Helge Hafting
  1 sibling, 0 replies; 296+ messages in thread
From: Helge Hafting @ 2006-06-19  7:48 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Jeff Garzik, Theodore Tso, Andrew Morton, ext2-devel,
	linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger

Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>>>>>>             
>
>  JG> No, there is a key difference between ext3 and SCSI/etc.:  cruft is removed.
>
>  JG> In ext3, old formats are supported for all eternity.
>
> we'd need this anyway. just to let users to migrate.
>   
Not really.  Today, people use reiserfs even though they couldn't
just remount their old ext2 as reiserfs.

An ext2/ext3-incompatible ext4 isn't a problem.  Sure, people will
have to mkfs instead of just remounting, and that will mean fewer
quick conversions in the short-term.  But people using ext3 today
don't really need ext4 - they are per definition running on sufficiently
small disks/partitions.

So an incompatible ext4 will still see use - on new filesystems mostly.
Not a problem, people buy disks all the time.

Helge Hafting

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-09 20:46         ` Linus Torvalds
  2006-06-09 20:56           ` Alex Tomas
@ 2006-06-20  6:15           ` Qi Yong
  2006-06-20  8:26             ` Laurent Vivier
  1 sibling, 1 reply; 296+ messages in thread
From: Qi Yong @ 2006-06-20  6:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Jeff Garzik, Andreas Dilger, Andrew Morton,
	ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
	linux-fsdevel, alex

Linus Torvalds wrote:

>On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>  
>
>>When is the Linux syscall interface enough?  When should we just bump it
>>and cut out all the compatibility interfaces?
>>
>>No, we don't; we let people configure certain obsolete bits out (a.out
>>support etc), but we keep it in the tree despite the indirection cost to
>>maintain multiple interfaces etc.
>>    
>>
>
>Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN WAYS THAT 
>MIGHT BREAK OLD USERS.
>
>Your point was exactly what?
>
>Btw, where did that 2TB limit number come from? Afaik, it should be 16TB 
>for a 4kB filesystem, no?
>  
>

Partition tables describe partitions in units of one sector.
2^(32+9) = 2T

To prevent integer overflow, we should use only 31 bits of a 32-bit integer.
2^(31+12) = 8T

There's _terrible_ hacks to really get to 16T.

-- qiyong

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-20  6:15           ` [Ext2-devel] " Qi Yong
@ 2006-06-20  8:26             ` Laurent Vivier
  2006-06-20  8:30               ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Laurent Vivier @ 2006-06-20  8:26 UTC (permalink / raw)
  To: Qi Yong
  Cc: Linus Torvalds, Andrew Morton, Jeff Garzik, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
	linux-fsdevel, alex, Andreas Dilger

[-- Attachment #1: Type: text/plain, Size: 1290 bytes --]

Qi Yong wrote:
> Linus Torvalds wrote:
> 
>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>  
>>
>>> When is the Linux syscall interface enough?  When should we just bump it
>>> and cut out all the compatibility interfaces?
>>>
>>> No, we don't; we let people configure certain obsolete bits out (a.out
>>> support etc), but we keep it in the tree despite the indirection cost to
>>> maintain multiple interfaces etc.
>>>    
>>>
>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN WAYS THAT 
>> MIGHT BREAK OLD USERS.
>>
>> Your point was exactly what?
>>
>> Btw, where did that 2TB limit number come from? Afaik, it should be 16TB 
>> for a 4kB filesystem, no?
>>  
>>
> 
> Partition tables describe partitions in units of one sector.
> 2^(32+9) = 2T
> 
> To prevent integer overflow, we should use only 31 bits of a 32-bit integer.
> 2^(31+12) = 8T
> 
> There's _terrible_ hacks to really get to 16T.
> 
> -- qiyong
> 

IMHO, a simple solution is to use "Logical Volume Manager" instead of partition
manager: we create 64bit filesystem in a Logical Volume, not in a partition.

"partitioning is obsolete" ;-)

Regards,
Laurent

-- 
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-20  8:26             ` Laurent Vivier
@ 2006-06-20  8:30               ` Jeff Garzik
  2006-06-20  9:21                 ` Laurent Vivier
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-20  8:30 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: Qi Yong, Linus Torvalds, Andrew Morton, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
	linux-fsdevel, alex, Andreas Dilger

Laurent Vivier wrote:
> Qi Yong wrote:
>> Linus Torvalds wrote:
>>
>>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>>  
>>>
>>>> When is the Linux syscall interface enough?  When should we just bump it
>>>> and cut out all the compatibility interfaces?
>>>>
>>>> No, we don't; we let people configure certain obsolete bits out (a.out
>>>> support etc), but we keep it in the tree despite the indirection cost to
>>>> maintain multiple interfaces etc.
>>>>    
>>>>
>>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN WAYS THAT 
>>> MIGHT BREAK OLD USERS.
>>>
>>> Your point was exactly what?
>>>
>>> Btw, where did that 2TB limit number come from? Afaik, it should be 16TB 
>>> for a 4kB filesystem, no?
>>>  
>>>
>> Partition tables describe partitions in units of one sector.
>> 2^(32+9) = 2T
>>
>> To prevent integer overflow, we should use only 31 bits of a 32-bit integer.
>> 2^(31+12) = 8T
>>
>> There's _terrible_ hacks to really get to 16T.
>>
>> -- qiyong
>>
> 
> IMHO, a simple solution is to use "Logical Volume Manager" instead of partition
> manager: we create 64bit filesystem in a Logical Volume, not in a partition.

That doesn't solve anything, if you are not using a 64bit filesystem.


> "partitioning is obsolete" ;-)

LVM is nothing but a partition manager...

	Jeff




^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
  2006-06-20  8:30               ` Jeff Garzik
@ 2006-06-20  9:21                 ` Laurent Vivier
  2006-06-20  9:48                   ` Jeff Garzik
  0 siblings, 1 reply; 296+ messages in thread
From: Laurent Vivier @ 2006-06-20  9:21 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Qi Yong, Linus Torvalds, Andrew Morton, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
	linux-fsdevel, alex, Andreas Dilger

[-- Attachment #1: Type: text/plain, Size: 2386 bytes --]

Jeff Garzik wrote:
> Laurent Vivier wrote:
>> Qi Yong wrote:
>>> Linus Torvalds wrote:
>>>
>>>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>>>  
>>>>
>>>>> When is the Linux syscall interface enough?  When should we just
>>>>> bump it
>>>>> and cut out all the compatibility interfaces?
>>>>>
>>>>> No, we don't; we let people configure certain obsolete bits out (a.out
>>>>> support etc), but we keep it in the tree despite the indirection
>>>>> cost to
>>>>> maintain multiple interfaces etc.
>>>>>   
>>>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN
>>>> WAYS THAT MIGHT BREAK OLD USERS.
>>>>
>>>> Your point was exactly what?
>>>>
>>>> Btw, where did that 2TB limit number come from? Afaik, it should be
>>>> 16TB for a 4kB filesystem, no?
>>>>  
>>>>
>>> Partition tables describe partitions in units of one sector.
>>> 2^(32+9) = 2T
>>>
>>> To prevent integer overflow, we should use only 31 bits of a 32-bit
>>> integer.
>>> 2^(31+12) = 8T
>>>
>>> There's _terrible_ hacks to really get to 16T.
>>>
>>> -- qiyong
>>>
>>
>> IMHO, a simple solution is to use "Logical Volume Manager" instead of
>> partition
>> manager: we create 64bit filesystem in a Logical Volume, not in a
>> partition.
> 
> That doesn't solve anything, if you are not using a 64bit filesystem.

Sorry, I don't undestand why ???

You can use 32bit filesystem too, but you limit the size of the logical volume
to be compatible with the filesystem you use. LVM allows to create several 32bit
volumes on a big (> 8T) disk (if exists)

But if we think further, as biggest disk is 750 GB (and I think even using HW
RAID, there is an HW limit something like 4 TB), we can imagine a big Volume
Group belonging several Physical Volumes divided in Logical Volumes: so we
already use LVM, we don't need partition...)

>> "partitioning is obsolete" ;-)
> 
> LVM is nothing but a partition manager...

LVM is more than a partition manager:
- it is arch-independent
- it is 64bit compliant
- it can gather together several disks
- it is flexible (you can add/remove/resize volume)
- it is modern (doesn't have primary/extended partition, doesn't have limited
number of partition)

so... it's a volume manager.

Regards,
Laurent
-- 
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-20  9:21                 ` Laurent Vivier
@ 2006-06-20  9:48                   ` Jeff Garzik
  2006-06-20 10:40                     ` Laurent Vivier
  0 siblings, 1 reply; 296+ messages in thread
From: Jeff Garzik @ 2006-06-20  9:48 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: Andrew Morton, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, alex, Andreas Dilger, Qi Yong

Laurent Vivier wrote:
> Jeff Garzik wrote:
>> Laurent Vivier wrote:
>>> Qi Yong wrote:
>>>> Linus Torvalds wrote:
>>>>
>>>>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>>>>  
>>>>>
>>>>>> When is the Linux syscall interface enough?  When should we just
>>>>>> bump it
>>>>>> and cut out all the compatibility interfaces?
>>>>>>
>>>>>> No, we don't; we let people configure certain obsolete bits out (a.out
>>>>>> support etc), but we keep it in the tree despite the indirection
>>>>>> cost to
>>>>>> maintain multiple interfaces etc.
>>>>>>   
>>>>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN
>>>>> WAYS THAT MIGHT BREAK OLD USERS.
>>>>>
>>>>> Your point was exactly what?
>>>>>
>>>>> Btw, where did that 2TB limit number come from? Afaik, it should be
>>>>> 16TB for a 4kB filesystem, no?
>>>>>  
>>>>>
>>>> Partition tables describe partitions in units of one sector.
>>>> 2^(32+9) = 2T
>>>>
>>>> To prevent integer overflow, we should use only 31 bits of a 32-bit
>>>> integer.
>>>> 2^(31+12) = 8T
>>>>
>>>> There's _terrible_ hacks to really get to 16T.
>>>>
>>>> -- qiyong
>>>>
>>> IMHO, a simple solution is to use "Logical Volume Manager" instead of
>>> partition
>>> manager: we create 64bit filesystem in a Logical Volume, not in a
>>> partition.
>> That doesn't solve anything, if you are not using a 64bit filesystem.
> 
> Sorry, I don't undestand why ???
> 
> You can use 32bit filesystem too, but you limit the size of the logical volume
> to be compatible with the filesystem you use. LVM allows to create several 32bit
> volumes on a big (> 8T) disk (if exists)

Let's review the thread:

qiyong: <these limits> exist in the filesystem
you: bust those limits with LVM!

I think you are misunderstanding the subthread.


>>> "partitioning is obsolete" ;-)
>> LVM is nothing but a partition manager...
> 
> LVM is more than a partition manager:

I am well aware of what LVM2 and device mapper can do.

	Jeff

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC 0/13] extents and 48bit ext3
  2006-06-20  9:48                   ` Jeff Garzik
@ 2006-06-20 10:40                     ` Laurent Vivier
  0 siblings, 0 replies; 296+ messages in thread
From: Laurent Vivier @ 2006-06-20 10:40 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Stephen C. Tweedie,
	ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
	Mingming Cao, linux-fsdevel, alex, Andreas Dilger, Qi Yong


[-- Attachment #1.1: Type: text/plain, Size: 2246 bytes --]

Jeff Garzik wrote:
> Laurent Vivier wrote:
>> Jeff Garzik wrote:
>>> Laurent Vivier wrote:
>>>> Qi Yong wrote:
>>>>> Linus Torvalds wrote:
>>>>>
>>>>>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>>>>>  
>>>>>>
>>>>>>> When is the Linux syscall interface enough?  When should we just
>>>>>>> bump it
>>>>>>> and cut out all the compatibility interfaces?
>>>>>>>
>>>>>>> No, we don't; we let people configure certain obsolete bits out
>>>>>>> (a.out
>>>>>>> support etc), but we keep it in the tree despite the indirection
>>>>>>> cost to
>>>>>>> maintain multiple interfaces etc.
>>>>>>>   
>>>>>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN
>>>>>> WAYS THAT MIGHT BREAK OLD USERS.
>>>>>>
>>>>>> Your point was exactly what?
>>>>>>
>>>>>> Btw, where did that 2TB limit number come from? Afaik, it should be
>>>>>> 16TB for a 4kB filesystem, no?
>>>>>>  
>>>>>>
>>>>> Partition tables describe partitions in units of one sector.
>>>>> 2^(32+9) = 2T
>>>>>
>>>>> To prevent integer overflow, we should use only 31 bits of a 32-bit
>>>>> integer.
>>>>> 2^(31+12) = 8T
>>>>>
>>>>> There's _terrible_ hacks to really get to 16T.
>>>>>
>>>>> -- qiyong
>>>>>
>>>> IMHO, a simple solution is to use "Logical Volume Manager" instead of
>>>> partition
>>>> manager: we create 64bit filesystem in a Logical Volume, not in a
>>>> partition.
>>> That doesn't solve anything, if you are not using a 64bit filesystem.
>>
>> Sorry, I don't undestand why ???
>>
>> You can use 32bit filesystem too, but you limit the size of the
>> logical volume
>> to be compatible with the filesystem you use. LVM allows to create
>> several 32bit
>> volumes on a big (> 8T) disk (if exists)
> 
> Let's review the thread:
> 
> qiyong: <these limits> exist in the filesystem
> you: bust those limits with LVM!
> 
> I think you are misunderstanding the subthread.
> 

Yes...

I understood:
qiyong: <these limits> exist in the partition manager.

(because with patches proposed on http://www.bullopensource.org/ext4 these
limits don't exist anymore in the filesystem.)

Regards,
Laurent
-- 
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



[-- Attachment #3: Type: text/plain, Size: 161 bytes --]

_______________________________________________
Ext2-devel mailing list
Ext2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ext2-devel

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update 0/16]extents and 48bit ext3/4 patches
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (2 preceding siblings ...)
  2006-06-09  9:13 ` Christoph Hellwig
@ 2006-06-30  0:16 ` Mingming Cao
  2006-06-30  0:16 ` [RFC][Update][Patch 1/16]core extent map support Mingming Cao
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

Hello!

Here is the updated ext3/4 patches to support 48bit ext4 filesystem.

http://ext2.sourceforge.net/48bitext3/patches/latest/

Changes since last post includes

- bug fixes in 64bit JBD changes, which breaks non-extents 32 bit ext3
filesystem
- sync up on disk super block structure with e2fsprogs
- add initial handing of uninitialized extents
- removed 32 bit ext3 bug fixes patches from this series as they are
folded to current linus git tree.


Patches against 2.6.17-git13, tested on both 32 bit and 64 bit arch,
survived fsx test on 32bit ext3(mounted w/o extent, with CONFIG_LBD
enabled) and 48 bit ext4(mounted with extents).

Appreciate any comments and feedbacks.

Thanks,

Mingming

-------------------------------------
patch series:

#------------------------
# base extent support (32bit)
#------------------------
ext3-extents.patch
#------------------------
# 48bit ext3 patches
#-------------------------
# 64 bit in-kernel block number support
#
#sector_t type format string for all arch
sector_fmt.patch
#support >32 bit fs block type in kernel (convert ext3_fsblk_t to sector_t)
ext3_fsblk_sector_t.patch
#
#48 bit extent map patches
#
ext3-extents-48bit.patch
ext3-extents-ext3_fsblk_t.patch
ext3-unitialized-extent-handling.patch
#
# 64bit JBD support
#
64bit_jbd_core.patch
jbd-avoid-blk-overflow-write-journal-metadata-tag.patch
jbd-read-32bit-tag-fix.patch
jbd-cleanup-journal_tag_bytes.patch
sector_t-jbd.patch
jbd-revoke-32bit-shift-fix.patch
#
# 48 bit on-disk xttar support
#
ext3_48bit_i_file_acl.patch
#
# 64bit on-disk sb metadata changes
#
64bit-metadata.patch
64bit-incompat-flag-change.patch
ext3-sb-struc-sync-with-e2fsprog.patch



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 1/16]core extent map support
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (3 preceding siblings ...)
  2006-06-30  0:16 ` [RFC][Update 0/16]extents and 48bit ext3/4 patches Mingming Cao
@ 2006-06-30  0:16 ` Mingming Cao
  2006-06-30  0:17 ` [RFC][Update][Patch 2/16]sector_t type format string Mingming Cao
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

Add extent map support to ext3. Patch from Alex Tomas.

On disk extents format:
/*
  * this is extent on-disk structure
  * it's used at the bottom of the tree
  */
struct ext3_extent {
        __le32  ee_block;       /* first logical block extent covers */
        __le16  ee_len;         /* number of blocks covered by extent */
        __le16  ee_start_hi;    /* high 16 bits of physical block */
        __le32  ee_start;       /* low 32 bigs of physical block */
};

Signed-off-by: Alex Tomas <alex@clusterfs.com>

---

 linux-2.6.17-ming/fs/ext3/Makefile                |    2 
 linux-2.6.17-ming/fs/ext3/dir.c                   |    3 
 linux-2.6.17-ming/fs/ext3/extents.c               | 2069 ++++++++++++++++++++++
 linux-2.6.17-ming/fs/ext3/ialloc.c                |   11 
 linux-2.6.17-ming/fs/ext3/inode.c                 |   17 
 linux-2.6.17-ming/fs/ext3/ioctl.c                 |    1 
 linux-2.6.17-ming/fs/ext3/super.c                 |   10 
 linux-2.6.17-ming/include/linux/ext3_fs.h         |   31 
 linux-2.6.17-ming/include/linux/ext3_fs_extents.h |  196 ++
 linux-2.6.17-ming/include/linux/ext3_fs_i.h       |   13 
 linux-2.6.17-ming/include/linux/ext3_fs_sb.h      |   10 
 linux-2.6.17-ming/include/linux/ext3_jbd.h        |   17 
 12 files changed, 2361 insertions(+), 19 deletions(-)

diff -puN fs/ext3/dir.c~ext3-extents fs/ext3/dir.c
--- linux-2.6.17/fs/ext3/dir.c~ext3-extents	2006-06-28 13:25:19.629991124 -0700
+++ linux-2.6.17-ming/fs/ext3/dir.c	2006-06-28 13:25:19.674985962 -0700
@@ -131,8 +131,7 @@ static int ext3_readdir(struct file * fi
 		struct buffer_head *bh = NULL;
 
 		map_bh.b_state = 0;
-		err = ext3_get_blocks_handle(NULL, inode, blk, 1,
-						&map_bh, 0, 0);
+		err = ext3_get_blocks_wrap(NULL, inode, blk, 1, &map_bh, 0, 0);
 		if (err > 0) {
 			page_cache_readahead(sb->s_bdev->bd_inode->i_mapping,
 				&filp->f_ra,
diff -puN /dev/null fs/ext3/extents.c
--- /dev/null	2006-06-28 00:02:13.345547960 -0700
+++ linux-2.6.17-ming/fs/ext3/extents.c	2006-06-28 13:39:22.744250572 -0700
@@ -0,0 +1,2069 @@
+/*
+ * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com
+ * Written by Alex Tomas <alex@clusterfs.com>
+ *
+ * Architecture independence:
+ *   Copyright (c) 2005, Bull S.A.
+ *   Written by Pierre Peiffer <pierre.peiffer@bull.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-
+ */
+
+/*
+ * Extents support for EXT3
+ *
+ * TODO:
+ *   - ext3*_error() should be used in some situations
+ *   - analyze all BUG()/BUG_ON(), use -EIO where appropriate
+ *   - smart tree reduction
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/time.h>
+#include <linux/ext3_jbd.h>
+#include <linux/jbd.h>
+#include <linux/smp_lock.h>
+#include <linux/highuid.h>
+#include <linux/pagemap.h>
+#include <linux/quotaops.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/ext3_fs_extents.h>
+#include <asm/uaccess.h>
+
+
+static int ext3_ext_check_header(const char *function, struct inode *inode,
+				struct ext3_extent_header *eh)
+{
+	const char *error_msg = NULL;
+
+	if (unlikely(eh->eh_magic != EXT3_EXT_MAGIC)) {
+		error_msg = "invalid magic";
+		goto corrupted;
+	}
+	if (unlikely(eh->eh_max == 0)) {
+		error_msg = "invalid eh_max";
+		goto corrupted;
+	}
+	if (unlikely(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max))) {
+		error_msg = "invalid eh_entries";
+		goto corrupted;
+	}
+	return 0;
+
+corrupted:
+	ext3_error(inode->i_sb, function,
+			"bad header in inode #%lu: %s - magic %x, "
+			"entries %u, max %u, depth %u",
+			inode->i_ino, error_msg, le16_to_cpu(eh->eh_magic),
+			le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max),
+			le16_to_cpu(eh->eh_depth));
+
+	return -EIO;
+}
+
+static handle_t *ext3_ext_journal_restart(handle_t *handle, int needed)
+{
+	int err;
+
+	if (handle->h_buffer_credits > needed)
+		return handle;
+	if (!ext3_journal_extend(handle, needed))
+		return handle;
+	err = ext3_journal_restart(handle, needed);
+
+	return handle;
+}
+
+/*
+ * could return:
+ *  - EROFS
+ *  - ENOMEM
+ */
+static int ext3_ext_get_access(handle_t *handle, struct inode *inode,
+				struct ext3_ext_path *path)
+{
+	if (path->p_bh) {
+		/* path points to block */
+		return ext3_journal_get_write_access(handle, path->p_bh);
+	}
+	/* path points to leaf/index in inode body */
+	/* we use in-core data, no need to protect them */
+	return 0;
+}
+
+/*
+ * could return:
+ *  - EROFS
+ *  - ENOMEM
+ *  - EIO
+ */
+static int ext3_ext_dirty(handle_t *handle, struct inode *inode,
+				struct ext3_ext_path *path)
+{
+	int err;
+	if (path->p_bh) {
+		/* path points to block */
+		err = ext3_journal_dirty_metadata(handle, path->p_bh);
+	} else {
+		/* path points to leaf/index in inode body */
+		err = ext3_mark_inode_dirty(handle, inode);
+	}
+	return err;
+}
+
+static int ext3_ext_find_goal(struct inode *inode,
+			      struct ext3_ext_path *path,
+			      unsigned long block)
+{
+	struct ext3_inode_info *ei = EXT3_I(inode);
+	unsigned long bg_start;
+	unsigned long colour;
+	int depth;
+
+	if (path) {
+		struct ext3_extent *ex;
+		depth = path->p_depth;
+
+		/* try to predict block placement */
+		if ((ex = path[depth].p_ext))
+			return le32_to_cpu(ex->ee_start)
+					+ (block - le32_to_cpu(ex->ee_block));
+
+		/* it looks index is empty
+		 * try to find starting from index itself */
+		if (path[depth].p_bh)
+			return path[depth].p_bh->b_blocknr;
+	}
+
+	/* OK. use inode's group */
+	bg_start = (ei->i_block_group * EXT3_BLOCKS_PER_GROUP(inode->i_sb)) +
+		le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block);
+	colour = (current->pid % 16) *
+			(EXT3_BLOCKS_PER_GROUP(inode->i_sb) / 16);
+	return bg_start + colour + block;
+}
+
+static int
+ext3_ext_new_block(handle_t *handle, struct inode *inode,
+			struct ext3_ext_path *path,
+			struct ext3_extent *ex, int *err)
+{
+	int goal, newblock;
+
+	goal = ext3_ext_find_goal(inode, path, le32_to_cpu(ex->ee_block));
+	newblock = ext3_new_block(handle, inode, goal, err);
+	return newblock;
+}
+
+static inline int ext3_ext_space_block(struct inode *inode)
+{
+	int size;
+
+	size = (inode->i_sb->s_blocksize - sizeof(struct ext3_extent_header))
+			/ sizeof(struct ext3_extent);
+#ifdef AGRESSIVE_TEST
+	if (size > 6)
+		size = 6;
+#endif
+	return size;
+}
+
+static inline int ext3_ext_space_block_idx(struct inode *inode)
+{
+	int size;
+
+	size = (inode->i_sb->s_blocksize - sizeof(struct ext3_extent_header))
+			/ sizeof(struct ext3_extent_idx);
+#ifdef AGRESSIVE_TEST
+	if (size > 5)
+		size = 5;
+#endif
+	return size;
+}
+
+static inline int ext3_ext_space_root(struct inode *inode)
+{
+	int size;
+
+	size = sizeof(EXT3_I(inode)->i_data);
+	size -= sizeof(struct ext3_extent_header);
+	size /= sizeof(struct ext3_extent);
+#ifdef AGRESSIVE_TEST
+	if (size > 3)
+		size = 3;
+#endif
+	return size;
+}
+
+static inline int ext3_ext_space_root_idx(struct inode *inode)
+{
+	int size;
+
+	size = sizeof(EXT3_I(inode)->i_data);
+	size -= sizeof(struct ext3_extent_header);
+	size /= sizeof(struct ext3_extent_idx);
+#ifdef AGRESSIVE_TEST
+	if (size > 4)
+		size = 4;
+#endif
+	return size;
+}
+
+#ifdef EXT_DEBUG
+static void ext3_ext_show_path(struct inode *inode, struct ext3_ext_path *path)
+{
+	int k, l = path->p_depth;
+
+	ext_debug("path:");
+	for (k = 0; k <= l; k++, path++) {
+		if (path->p_idx) {
+		  ext_debug("  %d->%d", le32_to_cpu(path->p_idx->ei_block),
+			    le32_to_cpu(path->p_idx->ei_leaf));
+		} else if (path->p_ext) {
+			ext_debug("  %d:%d:%d",
+				  le32_to_cpu(path->p_ext->ee_block),
+				  le16_to_cpu(path->p_ext->ee_len),
+				  le32_to_cpu(path->p_ext->ee_start));
+		} else
+			ext_debug("  []");
+	}
+	ext_debug("\n");
+}
+
+static void ext3_ext_show_leaf(struct inode *inode, struct ext3_ext_path *path)
+{
+	int depth = ext_depth(inode);
+	struct ext3_extent_header *eh;
+	struct ext3_extent *ex;
+	int i;
+
+	if (!path)
+		return;
+
+	eh = path[depth].p_hdr;
+	ex = EXT_FIRST_EXTENT(eh);
+
+	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
+		ext_debug("%d:%d:%d ", le32_to_cpu(ex->ee_block),
+			  le16_to_cpu(ex->ee_len),
+			  le32_to_cpu(ex->ee_start));
+	}
+	ext_debug("\n");
+}
+#else
+#define ext3_ext_show_path(inode,path)
+#define ext3_ext_show_leaf(inode,path)
+#endif
+
+static void ext3_ext_drop_refs(struct ext3_ext_path *path)
+{
+	int depth = path->p_depth;
+	int i;
+
+	for (i = 0; i <= depth; i++, path++)
+		if (path->p_bh) {
+			brelse(path->p_bh);
+			path->p_bh = NULL;
+		}
+}
+
+/*
+ * binary search for closest index by given block
+ */
+static void
+ext3_ext_binsearch_idx(struct inode *inode, struct ext3_ext_path *path, int block)
+{
+	struct ext3_extent_header *eh = path->p_hdr;
+	struct ext3_extent_idx *r, *l, *m;
+
+	BUG_ON(eh->eh_magic != EXT3_EXT_MAGIC);
+	BUG_ON(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max));
+	BUG_ON(le16_to_cpu(eh->eh_entries) <= 0);
+
+	ext_debug("binsearch for %d(idx):  ", block);
+
+	l = EXT_FIRST_INDEX(eh) + 1;
+	r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh->eh_entries) - 1;
+	while (l <= r) {
+		m = l + (r - l) / 2;
+		if (block < le32_to_cpu(m->ei_block))
+			r = m - 1;
+		else
+			l = m + 1;
+		ext_debug("%p(%u):%p(%u):%p(%u) ", l, l->ei_block,
+				m, m->ei_block, r, r->ei_block);
+	}
+
+	path->p_idx = l - 1;
+	ext_debug("  -> %d->%d ", le32_to_cpu(path->p_idx->ei_block),
+		  le32_to_cpu(path->p_idx->ei_leaf));
+
+#ifdef CHECK_BINSEARCH
+	{
+		struct ext3_extent_idx *chix, *ix;
+		int k;
+
+		chix = ix = EXT_FIRST_INDEX(eh);
+		for (k = 0; k < le16_to_cpu(eh->eh_entries); k++, ix++) {
+		  if (k != 0 &&
+		      le32_to_cpu(ix->ei_block) <= le32_to_cpu(ix[-1].ei_block)) {
+				printk("k=%d, ix=0x%p, first=0x%p\n", k,
+					ix, EXT_FIRST_INDEX(eh));
+				printk("%u <= %u\n",
+				       le32_to_cpu(ix->ei_block),
+				       le32_to_cpu(ix[-1].ei_block));
+			}
+			BUG_ON(k && le32_to_cpu(ix->ei_block)
+				           <= le32_to_cpu(ix[-1].ei_block));
+			if (block < le32_to_cpu(ix->ei_block))
+				break;
+			chix = ix;
+		}
+		BUG_ON(chix != path->p_idx);
+	}
+#endif
+
+}
+
+/*
+ * binary search for closest extent by given block
+ */
+static void
+ext3_ext_binsearch(struct inode *inode, struct ext3_ext_path *path, int block)
+{
+	struct ext3_extent_header *eh = path->p_hdr;
+	struct ext3_extent *r, *l, *m;
+
+	BUG_ON(eh->eh_magic != EXT3_EXT_MAGIC);
+	BUG_ON(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max));
+
+	if (eh->eh_entries == 0) {
+		/*
+		 * this leaf is empty yet:
+		 *  we get such a leaf in split/add case
+		 */
+		return;
+	}
+
+	ext_debug("binsearch for %d:  ", block);
+
+	l = EXT_FIRST_EXTENT(eh) + 1;
+	r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh->eh_entries) - 1;
+
+	while (l <= r) {
+		m = l + (r - l) / 2;
+		if (block < le32_to_cpu(m->ee_block))
+			r = m - 1;
+		else
+			l = m + 1;
+		ext_debug("%p(%u):%p(%u):%p(%u) ", l, l->ee_block,
+				m, m->ee_block, r, r->ee_block);
+	}
+
+	path->p_ext = l - 1;
+	ext_debug("  -> %d:%d:%d ",
+		        le32_to_cpu(path->p_ext->ee_block),
+		        le32_to_cpu(path->p_ext->ee_start),
+		        le16_to_cpu(path->p_ext->ee_len));
+
+#ifdef CHECK_BINSEARCH
+	{
+		struct ext3_extent *chex, *ex;
+		int k;
+
+		chex = ex = EXT_FIRST_EXTENT(eh);
+		for (k = 0; k < le16_to_cpu(eh->eh_entries); k++, ex++) {
+			BUG_ON(k && le32_to_cpu(ex->ee_block)
+				          <= le32_to_cpu(ex[-1].ee_block));
+			if (block < le32_to_cpu(ex->ee_block))
+				break;
+			chex = ex;
+		}
+		BUG_ON(chex != path->p_ext);
+	}
+#endif
+
+}
+
+int ext3_ext_tree_init(handle_t *handle, struct inode *inode)
+{
+	struct ext3_extent_header *eh;
+
+	eh = ext_inode_hdr(inode);
+	eh->eh_depth = 0;
+	eh->eh_entries = 0;
+	eh->eh_magic = EXT3_EXT_MAGIC;
+	eh->eh_max = cpu_to_le16(ext3_ext_space_root(inode));
+	ext3_mark_inode_dirty(handle, inode);
+	ext3_ext_invalidate_cache(inode);
+	return 0;
+}
+
+struct ext3_ext_path *
+ext3_ext_find_extent(struct inode *inode, int block, struct ext3_ext_path *path)
+{
+	struct ext3_extent_header *eh;
+	struct buffer_head *bh;
+	short int depth, i, ppos = 0, alloc = 0;
+
+	eh = ext_inode_hdr(inode);
+	BUG_ON(eh == NULL);
+	if (ext3_ext_check_header(__FUNCTION__, inode, eh))
+		return ERR_PTR(-EIO);
+
+	i = depth = ext_depth(inode);
+
+	/* account possible depth increase */
+	if (!path) {
+		path = kmalloc(sizeof(struct ext3_ext_path) * (depth + 2),
+				GFP_NOFS);
+		if (!path)
+			return ERR_PTR(-ENOMEM);
+		alloc = 1;
+	}
+	memset(path, 0, sizeof(struct ext3_ext_path) * (depth + 1));
+	path[0].p_hdr = eh;
+
+	/* walk through the tree */
+	while (i) {
+		ext_debug("depth %d: num %d, max %d\n",
+			  ppos, le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
+		ext3_ext_binsearch_idx(inode, path + ppos, block);
+		path[ppos].p_block = le32_to_cpu(path[ppos].p_idx->ei_leaf);
+		path[ppos].p_depth = i;
+		path[ppos].p_ext = NULL;
+
+		bh = sb_bread(inode->i_sb, path[ppos].p_block);
+		if (!bh)
+			goto err;
+
+		eh = ext_block_hdr(bh);
+		ppos++;
+		BUG_ON(ppos > depth);
+		path[ppos].p_bh = bh;
+		path[ppos].p_hdr = eh;
+		i--;
+
+		if (ext3_ext_check_header(__FUNCTION__, inode, eh))
+			goto err;
+	}
+
+	path[ppos].p_depth = i;
+	path[ppos].p_hdr = eh;
+	path[ppos].p_ext = NULL;
+	path[ppos].p_idx = NULL;
+
+	if (ext3_ext_check_header(__FUNCTION__, inode, eh))
+		goto err;
+
+	/* find extent */
+	ext3_ext_binsearch(inode, path + ppos, block);
+
+	ext3_ext_show_path(inode, path);
+
+	return path;
+
+err:
+	ext3_ext_drop_refs(path);
+	if (alloc)
+		kfree(path);
+	return ERR_PTR(-EIO);
+}
+
+/*
+ * insert new index [logical;ptr] into the block at cupr
+ * it check where to insert: before curp or after curp
+ */
+static int ext3_ext_insert_index(handle_t *handle, struct inode *inode,
+				struct ext3_ext_path *curp,
+				int logical, int ptr)
+{
+	struct ext3_extent_idx *ix;
+	int len, err;
+
+	if ((err = ext3_ext_get_access(handle, inode, curp)))
+		return err;
+
+	BUG_ON(logical == le32_to_cpu(curp->p_idx->ei_block));
+	len = EXT_MAX_INDEX(curp->p_hdr) - curp->p_idx;
+	if (logical > le32_to_cpu(curp->p_idx->ei_block)) {
+		/* insert after */
+		if (curp->p_idx != EXT_LAST_INDEX(curp->p_hdr)) {
+			len = (len - 1) * sizeof(struct ext3_extent_idx);
+			len = len < 0 ? 0 : len;
+			ext_debug("insert new index %d after: %d. "
+					"move %d from 0x%p to 0x%p\n",
+					logical, ptr, len,
+					(curp->p_idx + 1), (curp->p_idx + 2));
+			memmove(curp->p_idx + 2, curp->p_idx + 1, len);
+		}
+		ix = curp->p_idx + 1;
+	} else {
+		/* insert before */
+		len = len * sizeof(struct ext3_extent_idx);
+		len = len < 0 ? 0 : len;
+		ext_debug("insert new index %d before: %d. "
+				"move %d from 0x%p to 0x%p\n",
+				logical, ptr, len,
+				curp->p_idx, (curp->p_idx + 1));
+		memmove(curp->p_idx + 1, curp->p_idx, len);
+		ix = curp->p_idx;
+	}
+
+	ix->ei_block = cpu_to_le32(logical);
+	ix->ei_leaf = cpu_to_le32(ptr);
+	curp->p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(curp->p_hdr->eh_entries)+1);
+
+	BUG_ON(le16_to_cpu(curp->p_hdr->eh_entries)
+	                     > le16_to_cpu(curp->p_hdr->eh_max));
+	BUG_ON(ix > EXT_LAST_INDEX(curp->p_hdr));
+
+	err = ext3_ext_dirty(handle, inode, curp);
+	ext3_std_error(inode->i_sb, err);
+
+	return err;
+}
+
+/*
+ * routine inserts new subtree into the path, using free index entry
+ * at depth 'at:
+ *  - allocates all needed blocks (new leaf and all intermediate index blocks)
+ *  - makes decision where to split
+ *  - moves remaining extens and index entries (right to the split point)
+ *    into the newly allocated blocks
+ *  - initialize subtree
+ */
+static int ext3_ext_split(handle_t *handle, struct inode *inode,
+				struct ext3_ext_path *path,
+				struct ext3_extent *newext, int at)
+{
+	struct buffer_head *bh = NULL;
+	int depth = ext_depth(inode);
+	struct ext3_extent_header *neh;
+	struct ext3_extent_idx *fidx;
+	struct ext3_extent *ex;
+	int i = at, k, m, a;
+	unsigned long newblock, oldblock;
+	__le32 border;
+	int *ablocks = NULL; /* array of allocated blocks */
+	int err = 0;
+
+	/* make decision: where to split? */
+	/* FIXME: now desicion is simplest: at current extent */
+
+	/* if current leaf will be splitted, then we should use
+	 * border from split point */
+	BUG_ON(path[depth].p_ext > EXT_MAX_EXTENT(path[depth].p_hdr));
+	if (path[depth].p_ext != EXT_MAX_EXTENT(path[depth].p_hdr)) {
+		border = path[depth].p_ext[1].ee_block;
+		ext_debug("leaf will be splitted."
+				" next leaf starts at %d\n",
+			          le32_to_cpu(border));
+	} else {
+		border = newext->ee_block;
+		ext_debug("leaf will be added."
+				" next leaf starts at %d\n",
+			        le32_to_cpu(border));
+	}
+
+	/*
+	 * if error occurs, then we break processing
+	 * and turn filesystem read-only. so, index won't
+	 * be inserted and tree will be in consistent
+	 * state. next mount will repair buffers too
+	 */
+
+	/*
+	 * get array to track all allocated blocks
+	 * we need this to handle errors and free blocks
+	 * upon them
+	 */
+	ablocks = kmalloc(sizeof(unsigned long) * depth, GFP_NOFS);
+	if (!ablocks)
+		return -ENOMEM;
+	memset(ablocks, 0, sizeof(unsigned long) * depth);
+
+	/* allocate all needed blocks */
+	ext_debug("allocate %d blocks for indexes/leaf\n", depth - at);
+	for (a = 0; a < depth - at; a++) {
+		newblock = ext3_ext_new_block(handle, inode, path, newext, &err);
+		if (newblock == 0)
+			goto cleanup;
+		ablocks[a] = newblock;
+	}
+
+	/* initialize new leaf */
+	newblock = ablocks[--a];
+	BUG_ON(newblock == 0);
+	bh = sb_getblk(inode->i_sb, newblock);
+	if (!bh) {
+		err = -EIO;
+		goto cleanup;
+	}
+	lock_buffer(bh);
+
+	if ((err = ext3_journal_get_create_access(handle, bh)))
+		goto cleanup;
+
+	neh = ext_block_hdr(bh);
+	neh->eh_entries = 0;
+	neh->eh_max = cpu_to_le16(ext3_ext_space_block(inode));
+	neh->eh_magic = EXT3_EXT_MAGIC;
+	neh->eh_depth = 0;
+	ex = EXT_FIRST_EXTENT(neh);
+
+	/* move remain of path[depth] to the new leaf */
+	BUG_ON(path[depth].p_hdr->eh_entries != path[depth].p_hdr->eh_max);
+	/* start copy from next extent */
+	/* TODO: we could do it by single memmove */
+	m = 0;
+	path[depth].p_ext++;
+	while (path[depth].p_ext <=
+			EXT_MAX_EXTENT(path[depth].p_hdr)) {
+		ext_debug("move %d:%d:%d in new leaf %lu\n",
+			        le32_to_cpu(path[depth].p_ext->ee_block),
+			        le32_to_cpu(path[depth].p_ext->ee_start),
+			        le16_to_cpu(path[depth].p_ext->ee_len),
+				newblock);
+		/*memmove(ex++, path[depth].p_ext++,
+				sizeof(struct ext3_extent));
+		neh->eh_entries++;*/
+		path[depth].p_ext++;
+		m++;
+	}
+	if (m) {
+		memmove(ex, path[depth].p_ext-m, sizeof(struct ext3_extent)*m);
+		neh->eh_entries = cpu_to_le16(le16_to_cpu(neh->eh_entries)+m);
+	}
+
+	set_buffer_uptodate(bh);
+	unlock_buffer(bh);
+
+	if ((err = ext3_journal_dirty_metadata(handle, bh)))
+		goto cleanup;
+	brelse(bh);
+	bh = NULL;
+
+	/* correct old leaf */
+	if (m) {
+		if ((err = ext3_ext_get_access(handle, inode, path + depth)))
+			goto cleanup;
+		path[depth].p_hdr->eh_entries =
+		     cpu_to_le16(le16_to_cpu(path[depth].p_hdr->eh_entries)-m);
+		if ((err = ext3_ext_dirty(handle, inode, path + depth)))
+			goto cleanup;
+
+	}
+
+	/* create intermediate indexes */
+	k = depth - at - 1;
+	BUG_ON(k < 0);
+	if (k)
+		ext_debug("create %d intermediate indices\n", k);
+	/* insert new index into current index block */
+	/* current depth stored in i var */
+	i = depth - 1;
+	while (k--) {
+		oldblock = newblock;
+		newblock = ablocks[--a];
+		bh = sb_getblk(inode->i_sb, newblock);
+		if (!bh) {
+			err = -EIO;
+			goto cleanup;
+		}
+		lock_buffer(bh);
+
+		if ((err = ext3_journal_get_create_access(handle, bh)))
+			goto cleanup;
+
+		neh = ext_block_hdr(bh);
+		neh->eh_entries = cpu_to_le16(1);
+		neh->eh_magic = EXT3_EXT_MAGIC;
+		neh->eh_max = cpu_to_le16(ext3_ext_space_block_idx(inode));
+		neh->eh_depth = cpu_to_le16(depth - i);
+		fidx = EXT_FIRST_INDEX(neh);
+		fidx->ei_block = border;
+		fidx->ei_leaf = cpu_to_le32(oldblock);
+
+		ext_debug("int.index at %d (block %lu): %lu -> %lu\n", i,
+				newblock, (unsigned long) le32_to_cpu(border),
+			  	oldblock);
+		/* copy indexes */
+		m = 0;
+		path[i].p_idx++;
+
+		ext_debug("cur 0x%p, last 0x%p\n", path[i].p_idx,
+				EXT_MAX_INDEX(path[i].p_hdr));
+		BUG_ON(EXT_MAX_INDEX(path[i].p_hdr) !=
+				EXT_LAST_INDEX(path[i].p_hdr));
+		while (path[i].p_idx <= EXT_MAX_INDEX(path[i].p_hdr)) {
+			ext_debug("%d: move %d:%d in new index %lu\n", i,
+				        le32_to_cpu(path[i].p_idx->ei_block),
+				        le32_to_cpu(path[i].p_idx->ei_leaf),
+				        newblock);
+			/*memmove(++fidx, path[i].p_idx++,
+					sizeof(struct ext3_extent_idx));
+			neh->eh_entries++;
+			BUG_ON(neh->eh_entries > neh->eh_max);*/
+			path[i].p_idx++;
+			m++;
+		}
+		if (m) {
+			memmove(++fidx, path[i].p_idx - m,
+				sizeof(struct ext3_extent_idx) * m);
+			neh->eh_entries =
+				cpu_to_le16(le16_to_cpu(neh->eh_entries) + m);
+		}
+		set_buffer_uptodate(bh);
+		unlock_buffer(bh);
+
+		if ((err = ext3_journal_dirty_metadata(handle, bh)))
+			goto cleanup;
+		brelse(bh);
+		bh = NULL;
+
+		/* correct old index */
+		if (m) {
+			err = ext3_ext_get_access(handle, inode, path + i);
+			if (err)
+				goto cleanup;
+			path[i].p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(path[i].p_hdr->eh_entries)-m);
+			err = ext3_ext_dirty(handle, inode, path + i);
+			if (err)
+				goto cleanup;
+		}
+
+		i--;
+	}
+
+	/* insert new index */
+	if (err)
+		goto cleanup;
+
+	err = ext3_ext_insert_index(handle, inode, path + at,
+				    le32_to_cpu(border), newblock);
+
+cleanup:
+	if (bh) {
+		if (buffer_locked(bh))
+			unlock_buffer(bh);
+		brelse(bh);
+	}
+
+	if (err) {
+		/* free all allocated blocks in error case */
+		for (i = 0; i < depth; i++) {
+			if (!ablocks[i])
+				continue;
+			ext3_free_blocks(handle, inode, ablocks[i], 1);
+		}
+	}
+	kfree(ablocks);
+
+	return err;
+}
+
+/*
+ * routine implements tree growing procedure:
+ *  - allocates new block
+ *  - moves top-level data (index block or leaf) into the new block
+ *  - initialize new top-level, creating index that points to the
+ *    just created block
+ */
+static int ext3_ext_grow_indepth(handle_t *handle, struct inode *inode,
+					struct ext3_ext_path *path,
+					struct ext3_extent *newext)
+{
+	struct ext3_ext_path *curp = path;
+	struct ext3_extent_header *neh;
+	struct ext3_extent_idx *fidx;
+	struct buffer_head *bh;
+	unsigned long newblock;
+	int err = 0;
+
+	newblock = ext3_ext_new_block(handle, inode, path, newext, &err);
+	if (newblock == 0)
+		return err;
+
+	bh = sb_getblk(inode->i_sb, newblock);
+	if (!bh) {
+		err = -EIO;
+		ext3_std_error(inode->i_sb, err);
+		return err;
+	}
+	lock_buffer(bh);
+
+	if ((err = ext3_journal_get_create_access(handle, bh))) {
+		unlock_buffer(bh);
+		goto out;
+	}
+
+	/* move top-level index/leaf into new block */
+	memmove(bh->b_data, curp->p_hdr, sizeof(EXT3_I(inode)->i_data));
+
+	/* set size of new block */
+	neh = ext_block_hdr(bh);
+	/* old root could have indexes or leaves
+	 * so calculate e_max right way */
+	if (ext_depth(inode))
+	  neh->eh_max = cpu_to_le16(ext3_ext_space_block_idx(inode));
+	else
+	  neh->eh_max = cpu_to_le16(ext3_ext_space_block(inode));
+	neh->eh_magic = EXT3_EXT_MAGIC;
+	set_buffer_uptodate(bh);
+	unlock_buffer(bh);
+
+	if ((err = ext3_journal_dirty_metadata(handle, bh)))
+		goto out;
+
+	/* create index in new top-level index: num,max,pointer */
+	if ((err = ext3_ext_get_access(handle, inode, curp)))
+		goto out;
+
+	curp->p_hdr->eh_magic = EXT3_EXT_MAGIC;
+	curp->p_hdr->eh_max = cpu_to_le16(ext3_ext_space_root_idx(inode));
+	curp->p_hdr->eh_entries = cpu_to_le16(1);
+	curp->p_idx = EXT_FIRST_INDEX(curp->p_hdr);
+	/* FIXME: it works, but actually path[0] can be index */
+	curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block;
+	curp->p_idx->ei_leaf = cpu_to_le32(newblock);
+
+	neh = ext_inode_hdr(inode);
+	fidx = EXT_FIRST_INDEX(neh);
+	ext_debug("new root: num %d(%d), lblock %d, ptr %d\n",
+		  le16_to_cpu(neh->eh_entries), le16_to_cpu(neh->eh_max),
+		  le32_to_cpu(fidx->ei_block), le32_to_cpu(fidx->ei_leaf));
+
+	neh->eh_depth = cpu_to_le16(path->p_depth + 1);
+	err = ext3_ext_dirty(handle, inode, curp);
+out:
+	brelse(bh);
+
+	return err;
+}
+
+/*
+ * routine finds empty index and adds new leaf. if no free index found
+ * then it requests in-depth growing
+ */
+static int ext3_ext_create_new_leaf(handle_t *handle, struct inode *inode,
+					struct ext3_ext_path *path,
+					struct ext3_extent *newext)
+{
+	struct ext3_ext_path *curp;
+	int depth, i, err = 0;
+
+repeat:
+	i = depth = ext_depth(inode);
+
+	/* walk up to the tree and look for free index entry */
+	curp = path + depth;
+	while (i > 0 && !EXT_HAS_FREE_INDEX(curp)) {
+		i--;
+		curp--;
+	}
+
+	/* we use already allocated block for index block
+	 * so, subsequent data blocks should be contigoues */
+	if (EXT_HAS_FREE_INDEX(curp)) {
+		/* if we found index with free entry, then use that
+		 * entry: create all needed subtree and add new leaf */
+		err = ext3_ext_split(handle, inode, path, newext, i);
+
+		/* refill path */
+		ext3_ext_drop_refs(path);
+		path = ext3_ext_find_extent(inode,
+					    le32_to_cpu(newext->ee_block),
+					    path);
+		if (IS_ERR(path))
+			err = PTR_ERR(path);
+	} else {
+		/* tree is full, time to grow in depth */
+		err = ext3_ext_grow_indepth(handle, inode, path, newext);
+		if (err)
+			goto out;
+
+		/* refill path */
+		ext3_ext_drop_refs(path);
+		path = ext3_ext_find_extent(inode,
+					    le32_to_cpu(newext->ee_block),
+					    path);
+		if (IS_ERR(path)) {
+			err = PTR_ERR(path);
+			goto out;
+		}
+
+		/*
+		 * only first (depth 0 -> 1) produces free space
+		 * in all other cases we have to split growed tree
+		 */
+		depth = ext_depth(inode);
+		if (path[depth].p_hdr->eh_entries == path[depth].p_hdr->eh_max) {
+			/* now we need split */
+			goto repeat;
+		}
+	}
+
+out:
+	return err;
+}
+
+/*
+ * returns allocated block in subsequent extent or EXT_MAX_BLOCK
+ * NOTE: it consider block number from index entry as
+ * allocated block. thus, index entries have to be consistent
+ * with leafs
+ */
+static unsigned long
+ext3_ext_next_allocated_block(struct ext3_ext_path *path)
+{
+	int depth;
+
+	BUG_ON(path == NULL);
+	depth = path->p_depth;
+
+	if (depth == 0 && path->p_ext == NULL)
+		return EXT_MAX_BLOCK;
+
+	while (depth >= 0) {
+		if (depth == path->p_depth) {
+			/* leaf */
+			if (path[depth].p_ext !=
+					EXT_LAST_EXTENT(path[depth].p_hdr))
+			  return le32_to_cpu(path[depth].p_ext[1].ee_block);
+		} else {
+			/* index */
+			if (path[depth].p_idx !=
+					EXT_LAST_INDEX(path[depth].p_hdr))
+			  return le32_to_cpu(path[depth].p_idx[1].ei_block);
+		}
+		depth--;
+	}
+
+	return EXT_MAX_BLOCK;
+}
+
+/*
+ * returns first allocated block from next leaf or EXT_MAX_BLOCK
+ */
+static unsigned ext3_ext_next_leaf_block(struct inode *inode,
+                                               struct ext3_ext_path *path)
+{
+	int depth;
+
+	BUG_ON(path == NULL);
+	depth = path->p_depth;
+
+	/* zero-tree has no leaf blocks at all */
+	if (depth == 0)
+		return EXT_MAX_BLOCK;
+
+	/* go to index block */
+	depth--;
+
+	while (depth >= 0) {
+		if (path[depth].p_idx !=
+				EXT_LAST_INDEX(path[depth].p_hdr))
+		  return le32_to_cpu(path[depth].p_idx[1].ei_block);
+		depth--;
+	}
+
+	return EXT_MAX_BLOCK;
+}
+
+/*
+ * if leaf gets modified and modified extent is first in the leaf
+ * then we have to correct all indexes above
+ * TODO: do we need to correct tree in all cases?
+ */
+int ext3_ext_correct_indexes(handle_t *handle, struct inode *inode,
+				struct ext3_ext_path *path)
+{
+	struct ext3_extent_header *eh;
+	int depth = ext_depth(inode);
+	struct ext3_extent *ex;
+	__le32 border;
+	int k, err = 0;
+
+	eh = path[depth].p_hdr;
+	ex = path[depth].p_ext;
+	BUG_ON(ex == NULL);
+	BUG_ON(eh == NULL);
+
+	if (depth == 0) {
+		/* there is no tree at all */
+		return 0;
+	}
+
+	if (ex != EXT_FIRST_EXTENT(eh)) {
+		/* we correct tree if first leaf got modified only */
+		return 0;
+	}
+
+	/*
+	 * TODO: we need correction if border is smaller then current one
+	 */
+	k = depth - 1;
+	border = path[depth].p_ext->ee_block;
+	if ((err = ext3_ext_get_access(handle, inode, path + k)))
+		return err;
+	path[k].p_idx->ei_block = border;
+	if ((err = ext3_ext_dirty(handle, inode, path + k)))
+		return err;
+
+	while (k--) {
+		/* change all left-side indexes */
+		if (path[k+1].p_idx != EXT_FIRST_INDEX(path[k+1].p_hdr))
+			break;
+		if ((err = ext3_ext_get_access(handle, inode, path + k)))
+			break;
+		path[k].p_idx->ei_block = border;
+		if ((err = ext3_ext_dirty(handle, inode, path + k)))
+			break;
+	}
+
+	return err;
+}
+
+static int inline
+ext3_can_extents_be_merged(struct inode *inode, struct ext3_extent *ex1,
+				struct ext3_extent *ex2)
+{
+	/* FIXME: 48bit support */
+        if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len)
+	    != le32_to_cpu(ex2->ee_block))
+		return 0;
+
+#ifdef AGRESSIVE_TEST
+	if (le16_to_cpu(ex1->ee_len) >= 4)
+		return 0;
+#endif
+
+        if (le32_to_cpu(ex1->ee_start) + le16_to_cpu(ex1->ee_len)
+	    		== le32_to_cpu(ex2->ee_start))
+		return 1;
+	return 0;
+}
+
+/*
+ * this routine tries to merge requsted extent into the existing
+ * extent or inserts requested extent as new one into the tree,
+ * creating new leaf in no-space case
+ */
+int ext3_ext_insert_extent(handle_t *handle, struct inode *inode,
+				struct ext3_ext_path *path,
+				struct ext3_extent *newext)
+{
+	struct ext3_extent_header * eh;
+	struct ext3_extent *ex, *fex;
+	struct ext3_extent *nearex; /* nearest extent */
+	struct ext3_ext_path *npath = NULL;
+	int depth, len, err, next;
+
+	BUG_ON(newext->ee_len == 0);
+	depth = ext_depth(inode);
+	ex = path[depth].p_ext;
+	BUG_ON(path[depth].p_hdr == NULL);
+
+	/* try to insert block into found extent and return */
+	if (ex && ext3_can_extents_be_merged(inode, ex, newext)) {
+		ext_debug("append %d block to %d:%d (from %d)\n",
+				le16_to_cpu(newext->ee_len),
+				le32_to_cpu(ex->ee_block),
+				le16_to_cpu(ex->ee_len),
+				le32_to_cpu(ex->ee_start));
+		if ((err = ext3_ext_get_access(handle, inode, path + depth)))
+			return err;
+		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
+					 + le16_to_cpu(newext->ee_len));
+		eh = path[depth].p_hdr;
+		nearex = ex;
+		goto merge;
+	}
+
+repeat:
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+	if (le16_to_cpu(eh->eh_entries) < le16_to_cpu(eh->eh_max))
+		goto has_space;
+
+	/* probably next leaf has space for us? */
+	fex = EXT_LAST_EXTENT(eh);
+	next = ext3_ext_next_leaf_block(inode, path);
+	if (le32_to_cpu(newext->ee_block) > le32_to_cpu(fex->ee_block)
+	    && next != EXT_MAX_BLOCK) {
+		ext_debug("next leaf block - %d\n", next);
+		BUG_ON(npath != NULL);
+		npath = ext3_ext_find_extent(inode, next, NULL);
+		if (IS_ERR(npath))
+			return PTR_ERR(npath);
+		BUG_ON(npath->p_depth != path->p_depth);
+		eh = npath[depth].p_hdr;
+		if (le16_to_cpu(eh->eh_entries) < le16_to_cpu(eh->eh_max)) {
+			ext_debug("next leaf isnt full(%d)\n",
+				  le16_to_cpu(eh->eh_entries));
+			path = npath;
+			goto repeat;
+		}
+		ext_debug("next leaf has no free space(%d,%d)\n",
+			  le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
+	}
+
+	/*
+	 * there is no free space in found leaf
+	 * we're gonna add new leaf in the tree
+	 */
+	err = ext3_ext_create_new_leaf(handle, inode, path, newext);
+	if (err)
+		goto cleanup;
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+
+has_space:
+	nearex = path[depth].p_ext;
+
+	if ((err = ext3_ext_get_access(handle, inode, path + depth)))
+		goto cleanup;
+
+	if (!nearex) {
+		/* there is no extent in this leaf, create first one */
+		ext_debug("first extent in the leaf: %d:%d:%d\n",
+			        le32_to_cpu(newext->ee_block),
+			        le32_to_cpu(newext->ee_start),
+			        le16_to_cpu(newext->ee_len));
+		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
+	} else if (le32_to_cpu(newext->ee_block)
+		           > le32_to_cpu(nearex->ee_block)) {
+/* 		BUG_ON(newext->ee_block == nearex->ee_block); */
+		if (nearex != EXT_LAST_EXTENT(eh)) {
+			len = EXT_MAX_EXTENT(eh) - nearex;
+			len = (len - 1) * sizeof(struct ext3_extent);
+			len = len < 0 ? 0 : len;
+			ext_debug("insert %d:%d:%d after: nearest 0x%p, "
+					"move %d from 0x%p to 0x%p\n",
+				        le32_to_cpu(newext->ee_block),
+				        le32_to_cpu(newext->ee_start),
+				        le16_to_cpu(newext->ee_len),
+					nearex, len, nearex + 1, nearex + 2);
+			memmove(nearex + 2, nearex + 1, len);
+		}
+		path[depth].p_ext = nearex + 1;
+	} else {
+ 		BUG_ON(newext->ee_block == nearex->ee_block);
+		len = (EXT_MAX_EXTENT(eh) - nearex) * sizeof(struct ext3_extent);
+		len = len < 0 ? 0 : len;
+		ext_debug("insert %d:%d:%d before: nearest 0x%p, "
+				"move %d from 0x%p to 0x%p\n",
+				le32_to_cpu(newext->ee_block),
+				le32_to_cpu(newext->ee_start),
+				le16_to_cpu(newext->ee_len),
+				nearex, len, nearex + 1, nearex + 2);
+		memmove(nearex + 1, nearex, len);
+		path[depth].p_ext = nearex;
+	}
+
+	eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)+1);
+	nearex = path[depth].p_ext;
+	nearex->ee_block = newext->ee_block;
+	nearex->ee_start = newext->ee_start;
+	nearex->ee_len = newext->ee_len;
+	/* FIXME: support for large fs */
+	nearex->ee_start_hi = 0;
+
+merge:
+	/* try to merge extents to the right */
+	while (nearex < EXT_LAST_EXTENT(eh)) {
+		if (!ext3_can_extents_be_merged(inode, nearex, nearex + 1))
+			break;
+		/* merge with next extent! */
+		nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
+					     + le16_to_cpu(nearex[1].ee_len));
+		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
+					* sizeof(struct ext3_extent);
+			memmove(nearex + 1, nearex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
+		BUG_ON(eh->eh_entries == 0);
+	}
+
+	/* try to merge extents to the left */
+
+	/* time to correct all indexes above */
+	err = ext3_ext_correct_indexes(handle, inode, path);
+	if (err)
+		goto cleanup;
+
+	err = ext3_ext_dirty(handle, inode, path + depth);
+
+cleanup:
+	if (npath) {
+		ext3_ext_drop_refs(npath);
+		kfree(npath);
+	}
+	ext3_ext_tree_changed(inode);
+	ext3_ext_invalidate_cache(inode);
+	return err;
+}
+
+int ext3_ext_walk_space(struct inode *inode, unsigned long block,
+			unsigned long num, ext_prepare_callback func,
+			void *cbdata)
+{
+	struct ext3_ext_path *path = NULL;
+	struct ext3_ext_cache cbex;
+	struct ext3_extent *ex;
+	unsigned long next, start = 0, end = 0;
+	unsigned long last = block + num;
+	int depth, exists, err = 0;
+
+	BUG_ON(func == NULL);
+	BUG_ON(inode == NULL);
+
+	while (block < last && block != EXT_MAX_BLOCK) {
+		num = last - block;
+		/* find extent for this block */
+		path = ext3_ext_find_extent(inode, block, path);
+		if (IS_ERR(path)) {
+			err = PTR_ERR(path);
+			path = NULL;
+			break;
+		}
+
+		depth = ext_depth(inode);
+		BUG_ON(path[depth].p_hdr == NULL);
+		ex = path[depth].p_ext;
+		next = ext3_ext_next_allocated_block(path);
+
+		exists = 0;
+		if (!ex) {
+			/* there is no extent yet, so try to allocate
+			 * all requested space */
+			start = block;
+			end = block + num;
+		} else if (le32_to_cpu(ex->ee_block) > block) {
+			/* need to allocate space before found extent */
+			start = block;
+			end = le32_to_cpu(ex->ee_block);
+			if (block + num < end)
+				end = block + num;
+		} else if (block >=
+			     le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+			/* need to allocate space after found extent */
+			start = block;
+			end = block + num;
+			if (end >= next)
+				end = next;
+		} else if (block >= le32_to_cpu(ex->ee_block)) {
+			/*
+			 * some part of requested space is covered
+			 * by found extent
+			 */
+			start = block;
+			end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+			if (block + num < end)
+				end = block + num;
+			exists = 1;
+		} else {
+			BUG();
+		}
+		BUG_ON(end <= start);
+
+		if (!exists) {
+			cbex.ec_block = start;
+			cbex.ec_len = end - start;
+			cbex.ec_start = 0;
+			cbex.ec_type = EXT3_EXT_CACHE_GAP;
+		} else {
+		        cbex.ec_block = le32_to_cpu(ex->ee_block);
+		        cbex.ec_len = le16_to_cpu(ex->ee_len);
+		        cbex.ec_start = le32_to_cpu(ex->ee_start);
+			cbex.ec_type = EXT3_EXT_CACHE_EXTENT;
+		}
+
+		BUG_ON(cbex.ec_len == 0);
+		err = func(inode, path, &cbex, cbdata);
+		ext3_ext_drop_refs(path);
+
+		if (err < 0)
+			break;
+		if (err == EXT_REPEAT)
+			continue;
+		else if (err == EXT_BREAK) {
+			err = 0;
+			break;
+		}
+
+		if (ext_depth(inode) != depth) {
+			/* depth was changed. we have to realloc path */
+			kfree(path);
+			path = NULL;
+		}
+
+		block = cbex.ec_block + cbex.ec_len;
+	}
+
+	if (path) {
+		ext3_ext_drop_refs(path);
+		kfree(path);
+	}
+
+	return err;
+}
+
+static inline void
+ext3_ext_put_in_cache(struct inode *inode, __u32 block,
+			__u32 len, __u32 start, int type)
+{
+	struct ext3_ext_cache *cex;
+	BUG_ON(len == 0);
+	cex = &EXT3_I(inode)->i_cached_extent;
+	cex->ec_type = type;
+	cex->ec_block = block;
+	cex->ec_len = len;
+	cex->ec_start = start;
+}
+
+/*
+ * this routine calculate boundaries of the gap requested block fits into
+ * and cache this gap
+ */
+static inline void
+ext3_ext_put_gap_in_cache(struct inode *inode, struct ext3_ext_path *path,
+				unsigned long block)
+{
+	int depth = ext_depth(inode);
+	unsigned long lblock, len;
+	struct ext3_extent *ex;
+
+	ex = path[depth].p_ext;
+	if (ex == NULL) {
+		/* there is no extent yet, so gap is [0;-] */
+		lblock = 0;
+		len = EXT_MAX_BLOCK;
+		ext_debug("cache gap(whole file):");
+	} else if (block < le32_to_cpu(ex->ee_block)) {
+		lblock = block;
+		len = le32_to_cpu(ex->ee_block) - block;
+		ext_debug("cache gap(before): %lu [%lu:%lu]",
+				(unsigned long) block,
+			        (unsigned long) le32_to_cpu(ex->ee_block),
+			        (unsigned long) le16_to_cpu(ex->ee_len));
+	} else if (block >= le32_to_cpu(ex->ee_block)
+		            + le16_to_cpu(ex->ee_len)) {
+	        lblock = le32_to_cpu(ex->ee_block)
+		         + le16_to_cpu(ex->ee_len);
+		len = ext3_ext_next_allocated_block(path);
+		ext_debug("cache gap(after): [%lu:%lu] %lu",
+			        (unsigned long) le32_to_cpu(ex->ee_block),
+			        (unsigned long) le16_to_cpu(ex->ee_len),
+				(unsigned long) block);
+		BUG_ON(len == lblock);
+		len = len - lblock;
+	} else {
+		lblock = len = 0;
+		BUG();
+	}
+
+	ext_debug(" -> %lu:%lu\n", (unsigned long) lblock, len);
+	ext3_ext_put_in_cache(inode, lblock, len, 0, EXT3_EXT_CACHE_GAP);
+}
+
+static inline int
+ext3_ext_in_cache(struct inode *inode, unsigned long block,
+			struct ext3_extent *ex)
+{
+	struct ext3_ext_cache *cex;
+
+	cex = &EXT3_I(inode)->i_cached_extent;
+
+	/* has cache valid data? */
+	if (cex->ec_type == EXT3_EXT_CACHE_NO)
+		return EXT3_EXT_CACHE_NO;
+
+	BUG_ON(cex->ec_type != EXT3_EXT_CACHE_GAP &&
+			cex->ec_type != EXT3_EXT_CACHE_EXTENT);
+	if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) {
+	        ex->ee_block = cpu_to_le32(cex->ec_block);
+	        ex->ee_start = cpu_to_le32(cex->ec_start);
+	        ex->ee_len = cpu_to_le16(cex->ec_len);
+		ext_debug("%lu cached by %lu:%lu:%lu\n",
+				(unsigned long) block,
+				(unsigned long) cex->ec_block,
+				(unsigned long) cex->ec_len,
+				(unsigned long) cex->ec_start);
+		return cex->ec_type;
+	}
+
+	/* not in cache */
+	return EXT3_EXT_CACHE_NO;
+}
+
+/*
+ * routine removes index from the index block
+ * it's used in truncate case only. thus all requests are for
+ * last index in the block only
+ */
+int ext3_ext_rm_idx(handle_t *handle, struct inode *inode,
+			struct ext3_ext_path *path)
+{
+	struct buffer_head *bh;
+	int err;
+	unsigned long leaf;
+
+	/* free index block */
+	path--;
+	leaf = le32_to_cpu(path->p_idx->ei_leaf);
+	BUG_ON(path->p_hdr->eh_entries == 0);
+	if ((err = ext3_ext_get_access(handle, inode, path)))
+		return err;
+	path->p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(path->p_hdr->eh_entries)-1);
+	if ((err = ext3_ext_dirty(handle, inode, path)))
+		return err;
+	ext_debug("index is empty, remove it, free block %lu\n", leaf);
+	bh = sb_find_get_block(inode->i_sb, leaf);
+	ext3_forget(handle, 1, inode, bh, leaf);
+	ext3_free_blocks(handle, inode, leaf, 1);
+	return err;
+}
+
+/*
+ * This routine returns max. credits extent tree can consume.
+ * It should be OK for low-performance paths like ->writepage()
+ * To allow many writing process to fit a single transaction,
+ * caller should calculate credits under truncate_mutex and
+ * pass actual path.
+ */
+int inline ext3_ext_calc_credits_for_insert(struct inode *inode,
+						struct ext3_ext_path *path)
+{
+	int depth, needed;
+
+	if (path) {
+		/* probably there is space in leaf? */
+		depth = ext_depth(inode);
+		if (le16_to_cpu(path[depth].p_hdr->eh_entries)
+				< le16_to_cpu(path[depth].p_hdr->eh_max))
+			return 1;
+	}
+
+	/*
+	 * given 32bit logical block (4294967296 blocks), max. tree
+	 * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
+	 * let's also add one more level for imbalance.
+	 */
+	depth = 5;
+
+	/* allocation of new data block(s) */
+	needed = 2;
+
+	/*
+	 * tree can be full, so it'd need to grow in depth:
+	 * allocation + old root + new root
+	 */
+	needed += 2 + 1 + 1;
+
+	/*
+	 * Index split can happen, we'd need:
+	 *    allocate intermediate indexes (bitmap + group)
+	 *  + change two blocks at each level, but root (already included)
+	 */
+	needed = (depth * 2) + (depth * 2);
+
+	/* any allocation modifies superblock */
+	needed += 1;
+
+	return needed;
+}
+
+static int ext3_remove_blocks(handle_t *handle, struct inode *inode,
+				struct ext3_extent *ex,
+				unsigned long from, unsigned long to)
+{
+	struct buffer_head *bh;
+	int i;
+
+#ifdef EXTENTS_STATS
+	{
+		struct ext3_sb_info *sbi = EXT3_SB(inode->i_sb);
+		unsigned short ee_len =  le16_to_cpu(ex->ee_len);
+		spin_lock(&sbi->s_ext_stats_lock);
+		sbi->s_ext_blocks += ee_len;
+		sbi->s_ext_extents++;
+		if (ee_len < sbi->s_ext_min)
+			sbi->s_ext_min = ee_len;
+		if (ee_len > sbi->s_ext_max)
+			sbi->s_ext_max = ee_len;
+		if (ext_depth(inode) > sbi->s_depth_max)
+			sbi->s_depth_max = ext_depth(inode);
+		spin_unlock(&sbi->s_ext_stats_lock);
+	}
+#endif
+	if (from >= le32_to_cpu(ex->ee_block)
+	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		/* tail removal */
+		unsigned long num, start;
+		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
+		start = le32_to_cpu(ex->ee_start) + le16_to_cpu(ex->ee_len) - num;
+		ext_debug("free last %lu blocks starting %lu\n", num, start);
+		for (i = 0; i < num; i++) {
+			bh = sb_find_get_block(inode->i_sb, start + i);
+			ext3_forget(handle, 0, inode, bh, start + i);
+		}
+		ext3_free_blocks(handle, inode, start, num);
+	} else if (from == le32_to_cpu(ex->ee_block)
+		   && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		printk("strange request: removal %lu-%lu from %u:%u\n",
+		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+	} else {
+		printk("strange request: removal(2) %lu-%lu from %u:%u\n",
+		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+	}
+	return 0;
+}
+
+static int
+ext3_ext_rm_leaf(handle_t *handle, struct inode *inode,
+		struct ext3_ext_path *path, unsigned long start)
+{
+	int err = 0, correct_index = 0;
+	int depth = ext_depth(inode), credits;
+	struct ext3_extent_header *eh;
+	unsigned a, b, block, num;
+	unsigned long ex_ee_block;
+	unsigned short ex_ee_len;
+	struct ext3_extent *ex;
+
+	ext_debug("truncate since %lu in leaf\n", start);
+	if (!path[depth].p_hdr)
+		path[depth].p_hdr = ext_block_hdr(path[depth].p_bh);
+	eh = path[depth].p_hdr;
+	BUG_ON(eh == NULL);
+	BUG_ON(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max));
+	BUG_ON(eh->eh_magic != EXT3_EXT_MAGIC);
+
+	/* find where to start removing */
+	ex = EXT_LAST_EXTENT(eh);
+
+	ex_ee_block = le32_to_cpu(ex->ee_block);
+	ex_ee_len = le16_to_cpu(ex->ee_len);
+
+	while (ex >= EXT_FIRST_EXTENT(eh) &&
+			ex_ee_block + ex_ee_len > start) {
+		ext_debug("remove ext %lu:%u\n", ex_ee_block, ex_ee_len);
+		path[depth].p_ext = ex;
+
+		a = ex_ee_block > start ? ex_ee_block : start;
+		b = ex_ee_block + ex_ee_len - 1 < EXT_MAX_BLOCK ?
+			ex_ee_block + ex_ee_len - 1 : EXT_MAX_BLOCK;
+
+		ext_debug("  border %u:%u\n", a, b);
+
+		if (a != ex_ee_block && b != ex_ee_block + ex_ee_len - 1) {
+			block = 0;
+			num = 0;
+			BUG();
+		} else if (a != ex_ee_block) {
+			/* remove tail of the extent */
+			block = ex_ee_block;
+			num = a - block;
+		} else if (b != ex_ee_block + ex_ee_len - 1) {
+			/* remove head of the extent */
+			block = a;
+			num = b - a;
+			/* there is no "make a hole" API yet */
+			BUG();
+		} else {
+			/* remove whole extent: excellent! */
+			block = ex_ee_block;
+			num = 0;
+			BUG_ON(a != ex_ee_block);
+			BUG_ON(b != ex_ee_block + ex_ee_len - 1);
+		}
+
+		/* at present, extent can't cross block group */
+		/* leaf + bitmap + group desc + sb + inode */
+		credits = 5;
+		if (ex == EXT_FIRST_EXTENT(eh)) {
+			correct_index = 1;
+			credits += (ext_depth(inode)) + 1;
+		}
+#ifdef CONFIG_QUOTA
+		credits += 2 * EXT3_QUOTA_TRANS_BLOCKS(inode->i_sb);
+#endif
+
+		handle = ext3_ext_journal_restart(handle, credits);
+		if (IS_ERR(handle)) {
+			err = PTR_ERR(handle);
+			goto out;
+		}
+
+		err = ext3_ext_get_access(handle, inode, path + depth);
+		if (err)
+			goto out;
+
+		err = ext3_remove_blocks(handle, inode, ex, a, b);
+		if (err)
+			goto out;
+
+		if (num == 0) {
+			/* this extent is removed entirely mark slot unused */
+			ex->ee_start = 0;
+			eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
+		}
+
+		ex->ee_block = cpu_to_le32(block);
+		ex->ee_len = cpu_to_le16(num);
+
+		err = ext3_ext_dirty(handle, inode, path + depth);
+		if (err)
+			goto out;
+
+		ext_debug("new extent: %u:%u:%u\n", block, num,
+				le32_to_cpu(ex->ee_start));
+		ex--;
+		ex_ee_block = le32_to_cpu(ex->ee_block);
+		ex_ee_len = le16_to_cpu(ex->ee_len);
+	}
+
+	if (correct_index && eh->eh_entries)
+		err = ext3_ext_correct_indexes(handle, inode, path);
+
+	/* if this leaf is free, then we should
+	 * remove it from index block above */
+	if (err == 0 && eh->eh_entries == 0 && path[depth].p_bh != NULL)
+		err = ext3_ext_rm_idx(handle, inode, path + depth);
+
+out:
+	return err;
+}
+
+/*
+ * returns 1 if current index have to be freed (even partial)
+ */
+static int inline
+ext3_ext_more_to_rm(struct ext3_ext_path *path)
+{
+	BUG_ON(path->p_idx == NULL);
+
+	if (path->p_idx < EXT_FIRST_INDEX(path->p_hdr))
+		return 0;
+
+	/*
+	 * if truncate on deeper level happened it it wasn't partial
+	 * so we have to consider current index for truncation
+	 */
+	if (le16_to_cpu(path->p_hdr->eh_entries) == path->p_block)
+		return 0;
+	return 1;
+}
+
+int ext3_ext_remove_space(struct inode *inode, unsigned long start)
+{
+	struct super_block *sb = inode->i_sb;
+	int depth = ext_depth(inode);
+	struct ext3_ext_path *path;
+	handle_t *handle;
+	int i = 0, err = 0;
+
+	ext_debug("truncate since %lu\n", start);
+
+	/* probably first extent we're gonna free will be last in block */
+	handle = ext3_journal_start(inode, depth + 1);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	ext3_ext_invalidate_cache(inode);
+
+	/*
+	 * we start scanning from right side freeing all the blocks
+	 * after i_size and walking into the deep
+	 */
+	path = kmalloc(sizeof(struct ext3_ext_path) * (depth + 1), GFP_KERNEL);
+	if (path == NULL) {
+		ext3_journal_stop(handle);
+		return -ENOMEM;
+	}
+	memset(path, 0, sizeof(struct ext3_ext_path) * (depth + 1));
+	path[0].p_hdr = ext_inode_hdr(inode);
+	if (ext3_ext_check_header(__FUNCTION__, inode, path[0].p_hdr)) {
+		err = -EIO;
+		goto out;
+	}
+	path[0].p_depth = depth;
+
+	while (i >= 0 && err == 0) {
+		if (i == depth) {
+			/* this is leaf block */
+			err = ext3_ext_rm_leaf(handle, inode, path, start);
+			/* root level have p_bh == NULL, brelse() eats this */
+			brelse(path[i].p_bh);
+			path[i].p_bh = NULL;
+			i--;
+			continue;
+		}
+
+		/* this is index block */
+		if (!path[i].p_hdr) {
+			ext_debug("initialize header\n");
+			path[i].p_hdr = ext_block_hdr(path[i].p_bh);
+			if (ext3_ext_check_header(__FUNCTION__, inode,
+							path[i].p_hdr)) {
+				err = -EIO;
+				goto out;
+			}
+		}
+
+		BUG_ON(le16_to_cpu(path[i].p_hdr->eh_entries)
+			   > le16_to_cpu(path[i].p_hdr->eh_max));
+		BUG_ON(path[i].p_hdr->eh_magic != EXT3_EXT_MAGIC);
+
+		if (!path[i].p_idx) {
+			/* this level hasn't touched yet */
+			path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr);
+			path[i].p_block = le16_to_cpu(path[i].p_hdr->eh_entries)+1;
+			ext_debug("init index ptr: hdr 0x%p, num %d\n",
+				  path[i].p_hdr,
+				  le16_to_cpu(path[i].p_hdr->eh_entries));
+		} else {
+			/* we've already was here, see at next index */
+			path[i].p_idx--;
+		}
+
+		ext_debug("level %d - index, first 0x%p, cur 0x%p\n",
+				i, EXT_FIRST_INDEX(path[i].p_hdr),
+				path[i].p_idx);
+		if (ext3_ext_more_to_rm(path + i)) {
+			/* go to the next level */
+			ext_debug("move to level %d (block %d)\n",
+				  i + 1, le32_to_cpu(path[i].p_idx->ei_leaf));
+			memset(path + i + 1, 0, sizeof(*path));
+			path[i+1].p_bh =
+				sb_bread(sb, le32_to_cpu(path[i].p_idx->ei_leaf));
+			if (!path[i+1].p_bh) {
+				/* should we reset i_size? */
+				err = -EIO;
+				break;
+			}
+
+			/* put actual number of indexes to know is this
+			 * number got changed at the next iteration */
+			path[i].p_block = le16_to_cpu(path[i].p_hdr->eh_entries);
+			i++;
+		} else {
+			/* we finish processing this index, go up */
+			if (path[i].p_hdr->eh_entries == 0 && i > 0) {
+				/* index is empty, remove it
+				 * handle must be already prepared by the
+				 * truncatei_leaf() */
+				err = ext3_ext_rm_idx(handle, inode, path + i);
+			}
+			/* root level have p_bh == NULL, brelse() eats this */
+			brelse(path[i].p_bh);
+			path[i].p_bh = NULL;
+			i--;
+			ext_debug("return to level %d\n", i);
+		}
+	}
+
+	/* TODO: flexible tree reduction should be here */
+	if (path->p_hdr->eh_entries == 0) {
+		/*
+		 * truncate to zero freed all the tree
+		 * so, we need to correct eh_depth
+		 */
+		err = ext3_ext_get_access(handle, inode, path);
+		if (err == 0) {
+			ext_inode_hdr(inode)->eh_depth = 0;
+			ext_inode_hdr(inode)->eh_max =
+				cpu_to_le16(ext3_ext_space_root(inode));
+			err = ext3_ext_dirty(handle, inode, path);
+		}
+	}
+out:
+	ext3_ext_tree_changed(inode);
+	ext3_ext_drop_refs(path);
+	kfree(path);
+	ext3_journal_stop(handle);
+
+	return err;
+}
+
+/*
+ * called at mount time
+ */
+void ext3_ext_init(struct super_block *sb)
+{
+	/*
+	 * possible initialization would be here
+	 */
+
+	if (test_opt(sb, EXTENTS)) {
+		printk("EXT3-fs: file extents enabled");
+#ifdef AGRESSIVE_TEST
+		printk(", agressive tests");
+#endif
+#ifdef CHECK_BINSEARCH
+		printk(", check binsearch");
+#endif
+#ifdef EXTENTS_STATS
+		printk(", stats");
+#endif
+		printk("\n");
+#ifdef EXTENTS_STATS
+		spin_lock_init(&EXT3_SB(sb)->s_ext_stats_lock);
+		EXT3_SB(sb)->s_ext_min = 1 << 30;
+		EXT3_SB(sb)->s_ext_max = 0;
+#endif
+	}
+}
+
+/*
+ * called at umount time
+ */
+void ext3_ext_release(struct super_block *sb)
+{
+	if (!test_opt(sb, EXTENTS))
+		return;
+
+#ifdef EXTENTS_STATS
+	if (EXT3_SB(sb)->s_ext_blocks && EXT3_SB(sb)->s_ext_extents) {
+		struct ext3_sb_info *sbi = EXT3_SB(sb);
+		printk(KERN_ERR "EXT3-fs: %lu blocks in %lu extents (%lu ave)\n",
+			sbi->s_ext_blocks, sbi->s_ext_extents,
+			sbi->s_ext_blocks / sbi->s_ext_extents);
+		printk(KERN_ERR "EXT3-fs: extents: %lu min, %lu max, max depth %lu\n",
+			sbi->s_ext_min, sbi->s_ext_max, sbi->s_depth_max);
+	}
+#endif
+}
+
+int ext3_ext_get_blocks(handle_t *handle, struct inode *inode, sector_t iblock,
+			unsigned long max_blocks, struct buffer_head *bh_result,
+			int create, int extend_disksize)
+{
+	struct ext3_ext_path *path = NULL;
+	struct ext3_extent newex, *ex;
+	int goal, newblock, err = 0, depth;
+	unsigned long allocated = 0;
+
+	__clear_bit(BH_New, &bh_result->b_state);
+	ext_debug("blocks %d/%lu requested for inode %u\n", (int) iblock,
+			max_blocks, (unsigned) inode->i_ino);
+	mutex_lock(&EXT3_I(inode)->truncate_mutex);
+
+	/* check in cache */
+	if ((goal = ext3_ext_in_cache(inode, iblock, &newex))) {
+		if (goal == EXT3_EXT_CACHE_GAP) {
+			if (!create) {
+				/* block isn't allocated yet and
+				 * user don't want to allocate it */
+				goto out2;
+			}
+			/* we should allocate requested block */
+		} else if (goal == EXT3_EXT_CACHE_EXTENT) {
+			/* block is already allocated */
+		        newblock = iblock
+		                   - le32_to_cpu(newex.ee_block)
+			           + le32_to_cpu(newex.ee_start);
+			/* number of remain blocks in the extent */
+			allocated = le16_to_cpu(newex.ee_len) -
+					(iblock - le32_to_cpu(newex.ee_block));
+			goto out;
+		} else {
+			BUG();
+		}
+	}
+
+	/* find extent for this block */
+	path = ext3_ext_find_extent(inode, iblock, NULL);
+	if (IS_ERR(path)) {
+		err = PTR_ERR(path);
+		path = NULL;
+		goto out2;
+	}
+
+	depth = ext_depth(inode);
+
+	/*
+	 * consistent leaf must not be empty
+	 * this situations is possible, though, _during_ tree modification
+	 * this is why assert can't be put in ext3_ext_find_extent()
+	 */
+	BUG_ON(path[depth].p_ext == NULL && depth != 0);
+
+	if ((ex = path[depth].p_ext)) {
+	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
+		unsigned long ee_start = le32_to_cpu(ex->ee_start);
+		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+		/* if found exent covers block, simple return it */
+	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
+			newblock = iblock - ee_block + ee_start;
+			/* number of remain blocks in the extent */
+			allocated = ee_len - (iblock - ee_block);
+			ext_debug("%d fit into %lu:%d -> %d\n", (int) iblock,
+					ee_block, ee_len, newblock);
+			ext3_ext_put_in_cache(inode, ee_block, ee_len,
+						ee_start, EXT3_EXT_CACHE_EXTENT);
+			goto out;
+		}
+	}
+
+	/*
+	 * requested block isn't allocated yet
+	 * we couldn't try to create block if create flag is zero
+	 */
+	if (!create) {
+		/* put just found gap into cache to speedup subsequest reqs */
+		ext3_ext_put_gap_in_cache(inode, path, iblock);
+		goto out2;
+	}
+
+	/* allocate new block */
+	goal = ext3_ext_find_goal(inode, path, iblock);
+	allocated = max_blocks;
+	newblock = ext3_new_blocks(handle, inode, goal, &allocated, &err);
+	if (!newblock)
+		goto out2;
+	ext_debug("allocate new block: goal %d, found %d/%lu\n",
+			goal, newblock, allocated);
+
+	/* try to insert new extent into found leaf and return */
+	newex.ee_block = cpu_to_le32(iblock);
+	newex.ee_start = cpu_to_le32(newblock);
+	newex.ee_len = cpu_to_le16(allocated);
+	err = ext3_ext_insert_extent(handle, inode, path, &newex);
+	if (err)
+		goto out2;
+
+	if (extend_disksize && inode->i_size > EXT3_I(inode)->i_disksize)
+		EXT3_I(inode)->i_disksize = inode->i_size;
+
+	/* previous routine could use block we allocated */
+	newblock = le32_to_cpu(newex.ee_start);
+	__set_bit(BH_New, &bh_result->b_state);
+
+	ext3_ext_put_in_cache(inode, iblock, allocated, newblock,
+				EXT3_EXT_CACHE_EXTENT);
+out:
+	if (allocated > max_blocks)
+		allocated = max_blocks;
+	ext3_ext_show_leaf(inode, path);
+	__set_bit(BH_Mapped, &bh_result->b_state);
+	bh_result->b_bdev = inode->i_sb->s_bdev;
+	bh_result->b_blocknr = newblock;
+out2:
+	if (path) {
+		ext3_ext_drop_refs(path);
+		kfree(path);
+	}
+	mutex_unlock(&EXT3_I(inode)->truncate_mutex);
+
+	return err ? err : allocated;
+}
+
+void ext3_ext_truncate(struct inode * inode, struct page *page)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct super_block *sb = inode->i_sb;
+	unsigned long last_block;
+	handle_t *handle;
+	int err = 0;
+
+	/*
+	 * probably first extent we're gonna free will be last in block
+	 */
+	err = ext3_writepage_trans_blocks(inode) + 3;
+	handle = ext3_journal_start(inode, err);
+	if (IS_ERR(handle)) {
+		if (page) {
+			clear_highpage(page);
+			flush_dcache_page(page);
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		return;
+	}
+
+	if (page)
+		ext3_block_truncate_page(handle, page, mapping, inode->i_size);
+
+	mutex_lock(&EXT3_I(inode)->truncate_mutex);
+	ext3_ext_invalidate_cache(inode);
+
+	/*
+	 * TODO: optimization is possible here
+	 * probably we need not scaning at all,
+	 * because page truncation is enough
+	 */
+	if (ext3_orphan_add(handle, inode))
+		goto out_stop;
+
+	/* we have to know where to truncate from in crash case */
+	EXT3_I(inode)->i_disksize = inode->i_size;
+	ext3_mark_inode_dirty(handle, inode);
+
+	last_block = (inode->i_size + sb->s_blocksize - 1)
+			>> EXT3_BLOCK_SIZE_BITS(sb);
+	err = ext3_ext_remove_space(inode, last_block);
+
+	/* In a multi-transaction truncate, we only make the final
+	 * transaction synchronous */
+	if (IS_SYNC(inode))
+		handle->h_sync = 1;
+
+out_stop:
+	/*
+	 * If this was a simple ftruncate(), and the file will remain alive
+	 * then we need to clear up the orphan record which we created above.
+	 * However, if this was a real unlink then we were called by
+	 * ext3_delete_inode(), and we allow that function to clean up the
+	 * orphan info for us.
+	 */
+	if (inode->i_nlink)
+		ext3_orphan_del(handle, inode);
+
+	mutex_unlock(&EXT3_I(inode)->truncate_mutex);
+	ext3_journal_stop(handle);
+}
+
+/*
+ * this routine calculate max number of blocks we could modify
+ * in order to allocate new block for an inode
+ */
+int ext3_ext_writepage_trans_blocks(struct inode *inode, int num)
+{
+	int needed;
+
+	needed = ext3_ext_calc_credits_for_insert(inode, NULL);
+
+	/* caller want to allocate num blocks, but note it includes sb */
+	needed = needed * num - (num - 1);
+
+#ifdef CONFIG_QUOTA
+	needed += 2 * EXT3_QUOTA_TRANS_BLOCKS(inode->i_sb);
+#endif
+
+	return needed;
+}
+
+EXPORT_SYMBOL(ext3_mark_inode_dirty);
+EXPORT_SYMBOL(ext3_ext_invalidate_cache);
+EXPORT_SYMBOL(ext3_ext_insert_extent);
+EXPORT_SYMBOL(ext3_ext_walk_space);
+EXPORT_SYMBOL(ext3_ext_find_goal);
+EXPORT_SYMBOL(ext3_ext_calc_credits_for_insert);
+
diff -puN fs/ext3/ialloc.c~ext3-extents fs/ext3/ialloc.c
--- linux-2.6.17/fs/ext3/ialloc.c~ext3-extents	2006-06-28 13:25:19.631990895 -0700
+++ linux-2.6.17-ming/fs/ext3/ialloc.c	2006-06-28 13:39:24.990992777 -0700
@@ -616,6 +616,17 @@ got:
 		ext3_std_error(sb, err);
 		goto fail_free_drop;
 	}
+	if (test_opt(sb, EXTENTS)) {
+		EXT3_I(inode)->i_flags |= EXT3_EXTENTS_FL;
+		ext3_ext_tree_init(handle, inode);
+		if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS)) {
+			err = ext3_journal_get_write_access(handle, EXT3_SB(sb)->s_sbh);
+			if (err) goto fail;
+			EXT3_SET_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS);
+			BUFFER_TRACE(EXT3_SB(sb)->s_sbh, "call ext3_journal_dirty_metadata");
+			err = ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh);
+		}
+	}
 
 	ext3_debug("allocating inode %lu\n", inode->i_ino);
 	goto really_out;
diff -puN fs/ext3/inode.c~ext3-extents fs/ext3/inode.c
--- linux-2.6.17/fs/ext3/inode.c~ext3-extents	2006-06-28 13:25:19.648988944 -0700
+++ linux-2.6.17-ming/fs/ext3/inode.c	2006-06-28 13:39:05.103274732 -0700
@@ -39,8 +39,6 @@
 #include "xattr.h"
 #include "acl.h"
 
-static int ext3_writepage_trans_blocks(struct inode *inode);
-
 /*
  * Test whether an inode is a fast symlink.
  */
@@ -803,6 +801,7 @@ int ext3_get_blocks_handle(handle_t *han
 	ext3_fsblk_t first_block = 0;
 

+	J_ASSERT(!(EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL));
 	J_ASSERT(handle != NULL || create == 0);
 	depth = ext3_block_to_path(inode,iblock,offsets,&blocks_to_boundary);
 
@@ -983,7 +982,7 @@ static int ext3_get_block(struct inode *
 
 get_block:
 	if (ret == 0) {
-		ret = ext3_get_blocks_handle(handle, inode, iblock,
+		ret = ext3_get_blocks_wrap(handle, inode, iblock,
 					max_blocks, bh_result, create, 0);
 		if (ret > 0) {
 			bh_result->b_size = (ret << inode->i_blkbits);
@@ -1007,7 +1006,7 @@ struct buffer_head *ext3_getblk(handle_t
 	dummy.b_state = 0;
 	dummy.b_blocknr = -1000;
 	buffer_trace_init(&dummy.b_history);
-	err = ext3_get_blocks_handle(handle, inode, block, 1,
+	err = ext3_get_blocks_wrap(handle, inode, block, 1,
 					&dummy, create, 1);
 	if (err == 1) {
 		err = 0;
@@ -1755,7 +1754,7 @@ void ext3_set_aops(struct inode *inode)
  * This required during truncate. We need to physically zero the tail end
  * of that block so it doesn't yield old data if the file is later grown.
  */
-static int ext3_block_truncate_page(handle_t *handle, struct page *page,
+int ext3_block_truncate_page(handle_t *handle, struct page *page,
 		struct address_space *mapping, loff_t from)
 {
 	ext3_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -2259,6 +2258,9 @@ void ext3_truncate(struct inode *inode)
 			return;
 	}
 
+	if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL)
+		return ext3_ext_truncate(inode, page);
+
 	handle = start_transaction(inode);
 	if (IS_ERR(handle)) {
 		if (page) {
@@ -3001,12 +3003,15 @@ err_out:
  * block and work out the exact number of indirects which are touched.  Pah.
  */
 
-static int ext3_writepage_trans_blocks(struct inode *inode)
+int ext3_writepage_trans_blocks(struct inode *inode)
 {
 	int bpp = ext3_journal_blocks_per_page(inode);
 	int indirects = (EXT3_NDIR_BLOCKS % bpp) ? 5 : 3;
 	int ret;
 
+ 	if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL)
+ 		return ext3_ext_writepage_trans_blocks(inode, bpp);
+
 	if (ext3_should_journal_data(inode))
 		ret = 3 * (bpp + indirects) + 2;
 	else
diff -puN fs/ext3/ioctl.c~ext3-extents fs/ext3/ioctl.c
--- linux-2.6.17/fs/ext3/ioctl.c~ext3-extents	2006-06-28 13:25:19.649988830 -0700
+++ linux-2.6.17-ming/fs/ext3/ioctl.c	2006-06-28 13:25:19.680985273 -0700
@@ -247,7 +247,6 @@ flags_err:
 		return err;
 	}
 
-
 	default:
 		return -ENOTTY;
 	}
diff -puN fs/ext3/Makefile~ext3-extents fs/ext3/Makefile
--- linux-2.6.17/fs/ext3/Makefile~ext3-extents	2006-06-28 13:25:19.651988600 -0700
+++ linux-2.6.17-ming/fs/ext3/Makefile	2006-06-28 13:25:19.681985158 -0700
@@ -5,7 +5,7 @@
 obj-$(CONFIG_EXT3_FS) += ext3.o
 
 ext3-y	:= balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
-	   ioctl.o namei.o super.o symlink.o hash.o resize.o
+	   ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o
 
 ext3-$(CONFIG_EXT3_FS_XATTR)	 += xattr.o xattr_user.o xattr_trusted.o
 ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
diff -puN fs/ext3/super.c~ext3-extents fs/ext3/super.c
--- linux-2.6.17/fs/ext3/super.c~ext3-extents	2006-06-28 13:25:19.652988486 -0700
+++ linux-2.6.17-ming/fs/ext3/super.c	2006-06-28 13:39:24.996992088 -0700
@@ -390,6 +390,7 @@ static void ext3_put_super (struct super
 	struct ext3_super_block *es = sbi->s_es;
 	int i;
 
+ 	ext3_ext_release(sb);
 	ext3_xattr_put_super(sb);
 	journal_destroy(sbi->s_journal);
 	if (!(sb->s_flags & MS_RDONLY)) {
@@ -454,6 +455,7 @@ static struct inode *ext3_alloc_inode(st
 #endif
 	ei->i_block_alloc_info = NULL;
 	ei->vfs_inode.i_version = 1;
+	memset(&ei->i_cached_extent, 0, sizeof(struct ext3_ext_cache));
 	return &ei->vfs_inode;
 }
 
@@ -636,7 +638,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-	Opt_grpquota
+	Opt_grpquota, Opt_extents,
 };
 
 static match_table_t tokens = {
@@ -686,6 +688,7 @@ static match_table_t tokens = {
 	{Opt_quota, "quota"},
 	{Opt_usrquota, "usrquota"},
 	{Opt_barrier, "barrier=%u"},
+	{Opt_extents, "extents"},
 	{Opt_err, NULL},
 	{Opt_resize, "resize"},
 };
@@ -1018,6 +1021,9 @@ clear_qf_name:
 		case Opt_bh:
 			clear_opt(sbi->s_mount_opt, NOBH);
 			break;
+		case Opt_extents:
+			set_opt (sbi->s_mount_opt, EXTENTS);
+			break;
 		default:
 			printk (KERN_ERR
 				"EXT3-fs: Unrecognized mount option \"%s\" "
@@ -1743,6 +1749,8 @@ static int ext3_fill_super (struct super
 		test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
 		"writeback");
 
+	ext3_ext_init(sb);
+
 	lock_kernel();
 	return 0;
 
diff -puN /dev/null include/linux/ext3_fs_extents.h
--- /dev/null	2006-06-28 00:02:13.345547960 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_extents.h	2006-06-28 13:39:22.745250457 -0700
@@ -0,0 +1,196 @@
+/*
+ * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com
+ * Written by Alex Tomas <alex@clusterfs.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-
+ */
+
+#ifndef _LINUX_EXT3_EXTENTS
+#define _LINUX_EXT3_EXTENTS
+
+#include <linux/ext3_fs.h>
+
+/*
+ * with AGRESSIVE_TEST defined capacity of index/leaf blocks
+ * become very little, so index split, in-depth growing and
+ * other hard changes happens much more often
+ * this is for debug purposes only
+ */
+#define AGRESSIVE_TEST_
+
+/*
+ * with EXTENTS_STATS defined number of blocks and extents
+ * are collected in truncate path. they'll be showed at
+ * umount time
+ */
+#define EXTENTS_STATS__
+
+/*
+ * if CHECK_BINSEARCH defined, then results of binary search
+ * will be checked by linear search
+ */
+#define CHECK_BINSEARCH__
+
+/*
+ * if EXT_DEBUG is defined you can use 'extdebug' mount option
+ * to get lots of info what's going on
+ */
+#define EXT_DEBUG__
+#ifdef EXT_DEBUG
+#define ext_debug(a...)		printk(a)
+#else
+#define ext_debug(a...)
+#endif
+
+/*
+ * if EXT_STATS is defined then stats numbers are collected
+ * these number will be displayed at umount time
+ */
+#define EXT_STATS_
+
+
+/*
+ * ext3_inode has i_block array (60 bytes total)
+ * first 12 bytes store ext3_extent_header
+ * the remain stores array of ext3_extent
+ */
+
+/*
+ * this is extent on-disk structure
+ * it's used at the bottom of the tree
+ */
+struct ext3_extent {
+	__le32	ee_block;	/* first logical block extent covers */
+	__le16	ee_len;		/* number of blocks covered by extent */
+	__le16	ee_start_hi;	/* high 16 bits of physical block */
+	__le32	ee_start;	/* low 32 bigs of physical block */
+};
+
+/*
+ * this is index on-disk structure
+ * it's used at all the levels, but the bottom
+ */
+struct ext3_extent_idx {
+	__le32	ei_block;	/* index covers logical blocks from 'block' */
+	__le32	ei_leaf;	/* pointer to the physical block of the next *
+				 * level. leaf or next index could bet here */
+	__le16	ei_leaf_hi;	/* high 16 bits of physical block */
+	__u16	ei_unused;
+};
+
+/*
+ * each block (leaves and indexes), even inode-stored has header
+ */
+struct ext3_extent_header {
+	__le16	eh_magic;	/* probably will support different formats */
+	__le16	eh_entries;	/* number of valid entries */
+	__le16	eh_max;		/* capacity of store in entries */
+	__le16	eh_depth;	/* has tree real underlaying blocks? */
+	__le32	eh_generation;	/* generation of the tree */
+};
+
+#define EXT3_EXT_MAGIC		cpu_to_le16(0xf30a)
+
+/*
+ * array of ext3_ext_path contains path to some extent
+ * creation/lookup routines use it for traversal/splitting/etc
+ * truncate uses it to simulate recursive walking
+ */
+struct ext3_ext_path {
+	__u32				p_block;
+	__u16				p_depth;
+	struct ext3_extent		*p_ext;
+	struct ext3_extent_idx		*p_idx;
+	struct ext3_extent_header	*p_hdr;
+	struct buffer_head		*p_bh;
+};
+
+/*
+ * structure for external API
+ */
+
+#define EXT3_EXT_CACHE_NO	0
+#define EXT3_EXT_CACHE_GAP	1
+#define EXT3_EXT_CACHE_EXTENT	2
+
+/*
+ * to be called by ext3_ext_walk_space()
+ * negative retcode - error
+ * positive retcode - signal for ext3_ext_walk_space(), see below
+ * callback must return valid extent (passed or newly created)
+ */
+typedef int (*ext_prepare_callback)(struct inode *, struct ext3_ext_path *,
+					struct ext3_ext_cache *,
+					void *);
+
+#define EXT_CONTINUE	0
+#define EXT_BREAK	1
+#define EXT_REPEAT	2
+
+
+#define EXT_MAX_BLOCK	0xffffffff
+
+
+#define EXT_FIRST_EXTENT(__hdr__) \
+	((struct ext3_extent *) (((char *) (__hdr__)) +		\
+				 sizeof(struct ext3_extent_header)))
+#define EXT_FIRST_INDEX(__hdr__) \
+	((struct ext3_extent_idx *) (((char *) (__hdr__)) +	\
+				     sizeof(struct ext3_extent_header)))
+#define EXT_HAS_FREE_INDEX(__path__) \
+        (le16_to_cpu((__path__)->p_hdr->eh_entries) \
+	                             < le16_to_cpu((__path__)->p_hdr->eh_max))
+#define EXT_LAST_EXTENT(__hdr__) \
+	(EXT_FIRST_EXTENT((__hdr__)) + le16_to_cpu((__hdr__)->eh_entries) - 1)
+#define EXT_LAST_INDEX(__hdr__) \
+	(EXT_FIRST_INDEX((__hdr__)) + le16_to_cpu((__hdr__)->eh_entries) - 1)
+#define EXT_MAX_EXTENT(__hdr__) \
+	(EXT_FIRST_EXTENT((__hdr__)) + le16_to_cpu((__hdr__)->eh_max) - 1)
+#define EXT_MAX_INDEX(__hdr__) \
+	(EXT_FIRST_INDEX((__hdr__)) + le16_to_cpu((__hdr__)->eh_max) - 1)
+
+static inline struct ext3_extent_header *ext_inode_hdr(struct inode *inode)
+{
+	return (struct ext3_extent_header *) EXT3_I(inode)->i_data;
+}
+
+static inline struct ext3_extent_header *ext_block_hdr(struct buffer_head *bh)
+{
+	return (struct ext3_extent_header *) bh->b_data;
+}
+
+static inline unsigned short ext_depth(struct inode *inode)
+{
+	return le16_to_cpu(ext_inode_hdr(inode)->eh_depth);
+}
+
+static inline void ext3_ext_tree_changed(struct inode *inode)
+{
+	EXT3_I(inode)->i_ext_generation++;
+}
+
+static inline void
+ext3_ext_invalidate_cache(struct inode *inode)
+{
+	EXT3_I(inode)->i_cached_extent.ec_type = EXT3_EXT_CACHE_NO;
+}
+
+extern int ext3_extent_tree_init(handle_t *, struct inode *);
+extern int ext3_ext_calc_credits_for_insert(struct inode *, struct ext3_ext_path *);
+extern int ext3_ext_insert_extent(handle_t *, struct inode *, struct ext3_ext_path *, struct ext3_extent *);
+extern int ext3_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);
+extern struct ext3_ext_path * ext3_ext_find_extent(struct inode *, int, struct ext3_ext_path *);
+
+#endif /* _LINUX_EXT3_EXTENTS */
+
diff -puN include/linux/ext3_fs.h~ext3-extents include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~ext3-extents	2006-06-28 13:25:19.654988256 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h	2006-06-28 13:39:24.998991859 -0700
@@ -182,8 +182,9 @@ struct ext3_group_desc
 #define EXT3_DIRSYNC_FL			0x00010000 /* dirsync behaviour (directories only) */
 #define EXT3_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
 #define EXT3_RESERVED_FL		0x80000000 /* reserved for ext3 lib */
+#define EXT3_EXTENTS_FL			0x00080000 /* Inode uses extents */
 
-#define EXT3_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
+#define EXT3_FL_USER_VISIBLE		0x000BDFFF /* User visible flags */
 #define EXT3_FL_USER_MODIFIABLE		0x000380FF /* User modifiable flags */
 
 /*
@@ -371,6 +372,7 @@ struct ext3_inode {
 #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
 #define EXT3_MOUNT_USRQUOTA		0x100000 /* "old" user quota */
 #define EXT3_MOUNT_GRPQUOTA		0x200000 /* "old" group quota */
+#define EXT3_MOUNT_EXTENTS		0x400000 /* Extents support */
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
@@ -560,11 +562,13 @@ static inline struct ext3_inode_info *EX
 #define EXT3_FEATURE_INCOMPAT_RECOVER		0x0004 /* Needs recovery */
 #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV	0x0008 /* Journal device */
 #define EXT3_FEATURE_INCOMPAT_META_BG		0x0010
+#define EXT3_FEATURE_INCOMPAT_EXTENTS		0x0040 /* extents support */
 
 #define EXT3_FEATURE_COMPAT_SUPP	EXT2_FEATURE_COMPAT_EXT_ATTR
 #define EXT3_FEATURE_INCOMPAT_SUPP	(EXT3_FEATURE_INCOMPAT_FILETYPE| \
 					 EXT3_FEATURE_INCOMPAT_RECOVER| \
-					 EXT3_FEATURE_INCOMPAT_META_BG)
+					 EXT3_FEATURE_INCOMPAT_META_BG| \
+					 EXT3_FEATURE_INCOMPAT_EXTENTS)
 #define EXT3_FEATURE_RO_COMPAT_SUPP	(EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \
 					 EXT3_FEATURE_RO_COMPAT_BTREE_DIR)
@@ -803,6 +807,9 @@ extern int ext3_get_inode_loc(struct ino
 extern void ext3_truncate (struct inode *);
 extern void ext3_set_inode_flags(struct inode *);
 extern void ext3_set_aops(struct inode *inode);
+extern int ext3_writepage_trans_blocks(struct inode *);
+extern int ext3_block_truncate_page(handle_t *handle, struct page *page,
+		struct address_space *mapping, loff_t from);
 
 /* ioctl.c */
 extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
@@ -856,6 +863,26 @@ extern struct inode_operations ext3_spec
 extern struct inode_operations ext3_symlink_inode_operations;
 extern struct inode_operations ext3_fast_symlink_inode_operations;
 
+/* extents.c */
+extern int ext3_ext_tree_init(handle_t *handle, struct inode *);
+extern int ext3_ext_writepage_trans_blocks(struct inode *, int);
+extern int ext3_ext_get_blocks(handle_t *, struct inode *, sector_t,
+				unsigned long, struct buffer_head *, int, int);
+extern void ext3_ext_truncate(struct inode *, struct page *);
+extern void ext3_ext_init(struct super_block *);
+extern void ext3_ext_release(struct super_block *);
+static inline int
+ext3_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
+			unsigned long max_blocks, struct buffer_head *bh,
+			int create, int extend_disksize)
+{
+	if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL)
+		return ext3_ext_get_blocks(handle, inode, block, max_blocks,
+					bh, create, extend_disksize);
+	return ext3_get_blocks_handle(handle, inode, block, max_blocks, bh,
+					create, extend_disksize);
+}
+
 
 #endif	/* __KERNEL__ */
 
diff -puN include/linux/ext3_fs_i.h~ext3-extents include/linux/ext3_fs_i.h
--- linux-2.6.17/include/linux/ext3_fs_i.h~ext3-extents	2006-06-28 13:25:19.670986420 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_i.h	2006-06-28 13:39:24.999991744 -0700
@@ -65,6 +65,16 @@ struct ext3_block_alloc_info {
 #define rsv_end rsv_window._rsv_end
 
 /*
+ * storage for cached extent
+ */
+struct ext3_ext_cache {
+	__u32	ec_start;
+	__u32	ec_block;
+	__u32	ec_len; /* must be 32bit to return holes */
+	__u32	ec_type;
+};
+
+/*
  * third extended file system inode data in memory
  */
 struct ext3_inode_info {
@@ -142,6 +152,9 @@ struct ext3_inode_info {
 	 */
 	struct mutex truncate_mutex;
 	struct inode vfs_inode;
+
+	unsigned long i_ext_generation;
+ 	struct ext3_ext_cache i_cached_extent;
 };
 
 #endif	/* _LINUX_EXT3_FS_I */
diff -puN include/linux/ext3_fs_sb.h~ext3-extents include/linux/ext3_fs_sb.h
--- linux-2.6.17/include/linux/ext3_fs_sb.h~ext3-extents	2006-06-28 13:25:19.672986191 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_sb.h	2006-06-28 13:25:19.686984585 -0700
@@ -78,6 +78,16 @@ struct ext3_sb_info {
 	char *s_qf_names[MAXQUOTAS];		/* Names of quota files with journalled quota */
 	int s_jquota_fmt;			/* Format of quota to use */
 #endif
+
+#ifdef EXTENTS_STATS
+	/* ext3 extents stats */
+	unsigned long s_ext_min;
+	unsigned long s_ext_max;
+	unsigned long s_depth_max;
+	spinlock_t s_ext_stats_lock;
+	unsigned long s_ext_blocks;
+	unsigned long s_ext_extents;
+#endif
 };
 
 #endif	/* _LINUX_EXT3_FS_SB */
diff -puN include/linux/ext3_jbd.h~ext3-extents include/linux/ext3_jbd.h
--- linux-2.6.17/include/linux/ext3_jbd.h~ext3-extents	2006-06-28 13:25:19.673986076 -0700
+++ linux-2.6.17-ming/include/linux/ext3_jbd.h	2006-06-28 13:39:09.692748127 -0700
@@ -26,9 +26,14 @@
  * 
  * We may have to touch one inode, one bitmap buffer, up to three
  * indirection blocks, the group and superblock summaries, and the data
- * block to complete the transaction.  */
-
-#define EXT3_SINGLEDATA_TRANS_BLOCKS	8U
+ * block to complete the transaction.
+ *
+ * For extents-enabled fs we may have to allocate and modify upto
+ * 5 levels of tree + root which is stored in inode. */
+
+#define EXT3_SINGLEDATA_TRANS_BLOCKS(sb)				\
+	(EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS)	\
+	 	|| test_opt(sb, EXTENTS) ? 27U : 8U)
 
 /* Extended attribute operations touch at most two data buffers,
  * two bitmap buffers, and two group summaries, in addition to the inode
@@ -42,7 +47,7 @@
  * superblock only gets updated once, of course, so don't bother
  * counting that again for the quota updates. */
 
-#define EXT3_DATA_TRANS_BLOCKS(sb)	(EXT3_SINGLEDATA_TRANS_BLOCKS + \
+#define EXT3_DATA_TRANS_BLOCKS(sb)	(EXT3_SINGLEDATA_TRANS_BLOCKS(sb) + \
 					 EXT3_XATTR_TRANS_BLOCKS - 2 + \
 					 2*EXT3_QUOTA_TRANS_BLOCKS(sb))
 
@@ -78,9 +83,9 @@
 /* Amount of blocks needed for quota insert/delete - we do some block writes
  * but inode, sb and group updates are done only once */
 #define EXT3_QUOTA_INIT_BLOCKS(sb) (test_opt(sb, QUOTA) ? (DQUOT_INIT_ALLOC*\
-		(EXT3_SINGLEDATA_TRANS_BLOCKS-3)+3+DQUOT_INIT_REWRITE) : 0)
+		(EXT3_SINGLEDATA_TRANS_BLOCKS(sb)-3)+3+DQUOT_INIT_REWRITE) : 0)
 #define EXT3_QUOTA_DEL_BLOCKS(sb) (test_opt(sb, QUOTA) ? (DQUOT_DEL_ALLOC*\
-		(EXT3_SINGLEDATA_TRANS_BLOCKS-3)+3+DQUOT_DEL_REWRITE) : 0)
+		(EXT3_SINGLEDATA_TRANS_BLOCKS(sb)-3)+3+DQUOT_DEL_REWRITE) : 0)
 #else
 #define EXT3_QUOTA_TRANS_BLOCKS(sb) 0
 #define EXT3_QUOTA_INIT_BLOCKS(sb) 0

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 2/16]sector_t type format string
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (4 preceding siblings ...)
  2006-06-30  0:16 ` [RFC][Update][Patch 1/16]core extent map support Mingming Cao
@ 2006-06-30  0:17 ` Mingming Cao
  2006-06-30  0:17 ` [RFC][Update][Patch 3/16]convert ext3_fsblk_t to sector_t to support >32 bit block in kernel Mingming Cao
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

Define SECTOR_FMT to print sector_t in proper format

Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Acked-by: Andreas Dilger <adilger@clusterfs.com>


---

 linux-2.6.17-ming/include/asm-h8300/types.h   |    1 +
 linux-2.6.17-ming/include/asm-i386/types.h    |    1 +
 linux-2.6.17-ming/include/asm-mips/types.h    |    5 +++++
 linux-2.6.17-ming/include/asm-powerpc/types.h |    5 +++++
 linux-2.6.17-ming/include/asm-s390/types.h    |    5 +++++
 linux-2.6.17-ming/include/asm-sh/types.h      |    1 +
 linux-2.6.17-ming/include/asm-x86_64/types.h  |    1 +
 linux-2.6.17-ming/include/linux/types.h       |    1 +
 8 files changed, 20 insertions(+)

diff -puN include/asm-h8300/types.h~sector_fmt include/asm-h8300/types.h
--- linux-2.6.17/include/asm-h8300/types.h~sector_fmt	2006-06-28 16:46:28.523183099 -0700
+++ linux-2.6.17-ming/include/asm-h8300/types.h	2006-06-28 16:46:28.552179772 -0700
@@ -57,6 +57,7 @@ typedef u32 dma_addr_t;
 
 #define HAVE_SECTOR_T
 typedef u64 sector_t;
+#define SECTOR_FMT "%llu"
 
 #define HAVE_BLKCNT_T
 typedef u64 blkcnt_t;
diff -puN include/asm-i386/types.h~sector_fmt include/asm-i386/types.h
--- linux-2.6.17/include/asm-i386/types.h~sector_fmt	2006-06-28 16:46:28.526182755 -0700
+++ linux-2.6.17-ming/include/asm-i386/types.h	2006-06-28 16:46:28.553179658 -0700
@@ -59,6 +59,7 @@ typedef u64 dma64_addr_t;
 
 #ifdef CONFIG_LBD
 typedef u64 sector_t;
+#define SECTOR_FMT "%llu"
 #define HAVE_SECTOR_T
 #endif
 
diff -puN include/asm-mips/types.h~sector_fmt include/asm-mips/types.h
--- linux-2.6.17/include/asm-mips/types.h~sector_fmt	2006-06-28 16:46:28.530182296 -0700
+++ linux-2.6.17-ming/include/asm-mips/types.h	2006-06-28 16:46:28.554179543 -0700
@@ -95,6 +95,11 @@ typedef unsigned long phys_t;
 
 #ifdef CONFIG_LBD
 typedef u64 sector_t;
+#if (_MIPS_SZLONG == 64)
+#define SECTOR_FMT "%lu"
+#else
+#define SECTOR_FMT "%llu"
+#endif
 #define HAVE_SECTOR_T
 #endif
 
diff -puN include/asm-powerpc/types.h~sector_fmt include/asm-powerpc/types.h
--- linux-2.6.17/include/asm-powerpc/types.h~sector_fmt	2006-06-28 16:46:28.534181837 -0700
+++ linux-2.6.17-ming/include/asm-powerpc/types.h	2006-06-28 16:46:28.554179543 -0700
@@ -99,6 +99,11 @@ typedef struct {
 
 #ifdef CONFIG_LBD
 typedef u64 sector_t;
+#ifdef __powerpc64__
+#define SECTOR_FMT "%lu"
+#else
+#define SECTOR_FMT "%llu"
+#endif
 #define HAVE_SECTOR_T
 #endif
 
diff -puN include/asm-s390/types.h~sector_fmt include/asm-s390/types.h
--- linux-2.6.17/include/asm-s390/types.h~sector_fmt	2006-06-28 16:46:28.537181493 -0700
+++ linux-2.6.17-ming/include/asm-s390/types.h	2006-06-28 16:46:28.555179428 -0700
@@ -89,6 +89,11 @@ typedef union {
 
 #ifdef CONFIG_LBD
 typedef u64 sector_t;
+#ifndef __s390x__
+#define SECTOR_FMT "%llu"
+#else
+#define SECTOR_FMT "%lu"
+#endif
 #define HAVE_SECTOR_T
 #endif
 
diff -puN include/asm-sh/types.h~sector_fmt include/asm-sh/types.h
--- linux-2.6.17/include/asm-sh/types.h~sector_fmt	2006-06-28 16:46:28.540181149 -0700
+++ linux-2.6.17-ming/include/asm-sh/types.h	2006-06-28 16:46:28.555179428 -0700
@@ -54,6 +54,7 @@ typedef u32 dma_addr_t;
 
 #ifdef CONFIG_LBD
 typedef u64 sector_t;
+#define SECTOR_FMT "%llu"
 #define HAVE_SECTOR_T
 #endif
 
diff -puN include/asm-x86_64/types.h~sector_fmt include/asm-x86_64/types.h
--- linux-2.6.17/include/asm-x86_64/types.h~sector_fmt	2006-06-28 16:46:28.543180805 -0700
+++ linux-2.6.17-ming/include/asm-x86_64/types.h	2006-06-28 16:46:28.556179313 -0700
@@ -49,6 +49,7 @@ typedef u64 dma64_addr_t;
 typedef u64 dma_addr_t;
 
 typedef u64 sector_t;
+#define SECTOR_FMT "%llu"
 #define HAVE_SECTOR_T
 
 #endif /* __ASSEMBLY__ */
diff -puN include/linux/types.h~sector_fmt include/linux/types.h
--- linux-2.6.17/include/linux/types.h~sector_fmt	2006-06-28 16:46:28.549180116 -0700
+++ linux-2.6.17-ming/include/linux/types.h	2006-06-28 16:46:28.557179199 -0700
@@ -134,6 +134,7 @@ typedef		__s64		int64_t;
  */
 #ifndef HAVE_SECTOR_T
 typedef unsigned long sector_t;
+#define SECTOR_FMT "%lu"
 #endif
 
 #ifndef HAVE_BLKCNT_T

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 3/16]convert ext3_fsblk_t to sector_t to support >32 bit block in kernel
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (5 preceding siblings ...)
  2006-06-30  0:17 ` [RFC][Update][Patch 2/16]sector_t type format string Mingming Cao
@ 2006-06-30  0:17 ` Mingming Cao
  2006-06-30  0:17 ` [RFC][Update][Patch 4/16]support 48 bit blk number in extents Mingming Cao
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

Redefine ext3 in-kernel filesystem block type (ext3_fsblk_t) from unsigned
long to sector_t, to allow kernel to handle  >32 bit ext3 blocks.

Signed-Off-By: Mingming Cao <cmm@us.ibm.com>


---

 linux-2.6.17-ming/fs/ext3/balloc.c          |   22 ++++++++--------------
 linux-2.6.17-ming/fs/ext3/ialloc.c          |   11 +++++++----
 linux-2.6.17-ming/fs/ext3/resize.c          |   14 ++++++--------
 linux-2.6.17-ming/fs/ext3/super.c           |    8 ++++----
 linux-2.6.17-ming/include/linux/ext3_fs.h   |   26 ++++++++++++++++++++++++++
 linux-2.6.17-ming/include/linux/ext3_fs_i.h |    4 ++--
 6 files changed, 53 insertions(+), 32 deletions(-)

diff -puN fs/ext3/balloc.c~ext3_fsblk_sector_t fs/ext3/balloc.c
--- linux-2.6.17/fs/ext3/balloc.c~ext3_fsblk_sector_t	2006-06-28 16:46:36.057318618 -0700
+++ linux-2.6.17-ming/fs/ext3/balloc.c	2006-06-28 16:46:36.082315750 -0700
@@ -38,7 +38,6 @@
 
 
 #define in_range(b, first, len)	((b) >= (first) && (b) <= (first) + (len) - 1)
-
 struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb,
 					     unsigned int block_group,
 					     struct buffer_head ** bh)
@@ -340,10 +339,7 @@ void ext3_free_blocks_sb(handle_t *handl
 
 do_more:
 	overflow = 0;
-	block_group = (block - le32_to_cpu(es->s_first_data_block)) /
-		      EXT3_BLOCKS_PER_GROUP(sb);
-	bit = (block - le32_to_cpu(es->s_first_data_block)) %
-		      EXT3_BLOCKS_PER_GROUP(sb);
+	ext3_get_group_no_and_offset(sb, block, &block_group, &bit);
 	/*
 	 * Check to see if we are freeing blocks across a group
 	 * boundary.
@@ -1205,7 +1201,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 {
 	struct buffer_head *bitmap_bh = NULL;
 	struct buffer_head *gdp_bh;
-	int group_no;
+	unsigned long group_no;
 	int goal_group;
 	ext3_grpblk_t grp_target_blk;	/* blockgroup relative goal block */
 	ext3_grpblk_t grp_alloc_blk;	/* blockgroup-relative allocated block*/
@@ -1268,8 +1264,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	if (goal < le32_to_cpu(es->s_first_data_block) ||
 	    goal >= le32_to_cpu(es->s_blocks_count))
 		goal = le32_to_cpu(es->s_first_data_block);
-	group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
-			EXT3_BLOCKS_PER_GROUP(sb);
+	ext3_get_group_no_and_offset(sb, goal, &group_no, &grp_target_blk);
 	gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
 	if (!gdp)
 		goto io_error;
@@ -1286,8 +1281,6 @@ retry:
 		my_rsv = NULL;
 
 	if (free_blocks > 0) {
-		grp_target_blk = ((goal - le32_to_cpu(es->s_first_data_block)) %
-				EXT3_BLOCKS_PER_GROUP(sb));
 		bitmap_bh = read_block_bitmap(sb, group_no);
 		if (!bitmap_bh)
 			goto io_error;
@@ -1414,7 +1407,7 @@ allocated:
 	if (ret_block + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
 		ext3_error(sb, "ext3_new_block",
 			    "block("E3FSBLK") >= blocks count(%d) - "
-			    "block_group = %d, es == %p ", ret_block,
+			    "block_group = %lu, es == %p ", ret_block,
 			le32_to_cpu(es->s_blocks_count), group_no, es);
 		goto out;
 	}
@@ -1528,9 +1521,10 @@ ext3_fsblk_t ext3_count_free_blocks(stru
 static inline int
 block_in_use(ext3_fsblk_t block, struct super_block *sb, unsigned char *map)
 {
-	return ext3_test_bit ((block -
-		le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block)) %
-			 EXT3_BLOCKS_PER_GROUP(sb), map);
+	ext3_grpblk_t offset;
+
+	ext3_get_group_no_and_offset(sb, block, NULL, &offset);
+	return ext3_test_bit (offset, map);
 }
 
 static inline int test_root(int a, int b)
diff -puN fs/ext3/ialloc.c~ext3_fsblk_sector_t fs/ext3/ialloc.c
--- linux-2.6.17/fs/ext3/ialloc.c~ext3_fsblk_sector_t	2006-06-28 16:46:36.060318274 -0700
+++ linux-2.6.17-ming/fs/ext3/ialloc.c	2006-06-28 16:46:36.084315520 -0700
@@ -23,7 +23,7 @@
 #include <linux/buffer_head.h>
 #include <linux/random.h>
 #include <linux/bitops.h>
-
+#include <linux/blkdev.h>
 #include <asm/byteorder.h>
 
 #include "xattr.h"
@@ -274,7 +274,8 @@ static int find_group_orlov(struct super
 	freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
 	avefreei = freei / ngroups;
 	freeb = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
-	avefreeb = freeb / ngroups;
+	avefreeb = freeb;
+	sector_div(avefreeb, ngroups);
 	ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
 
 	if ((parent == sb->s_root->d_inode) ||
@@ -303,13 +304,15 @@ static int find_group_orlov(struct super
 		goto fallback;
 	}
 
-	blocks_per_dir = (le32_to_cpu(es->s_blocks_count) - freeb) / ndirs;
+	blocks_per_dir = le32_to_cpu(es->s_blocks_count) - freeb;
+	sector_div(blocks_per_dir, ndirs);
 
 	max_dirs = ndirs / ngroups + inodes_per_group / 16;
 	min_inodes = avefreei - inodes_per_group / 4;
 	min_blocks = avefreeb - EXT3_BLOCKS_PER_GROUP(sb) / 4;
 
-	max_debt = EXT3_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, (ext3_fsblk_t)BLOCK_COST);
+	max_debt = EXT3_BLOCKS_PER_GROUP(sb);
+	sector_div(max_debt, max(blocks_per_dir, (ext3_fsblk_t)BLOCK_COST));
 	if (max_debt * INODE_COST > inodes_per_group)
 		max_debt = inodes_per_group / INODE_COST;
 	if (max_debt > 255)
diff -puN fs/ext3/resize.c~ext3_fsblk_sector_t fs/ext3/resize.c
--- linux-2.6.17/fs/ext3/resize.c~ext3_fsblk_sector_t	2006-06-28 16:46:36.065317700 -0700
+++ linux-2.6.17-ming/fs/ext3/resize.c	2006-06-28 16:46:36.086315291 -0700
@@ -15,7 +15,6 @@
 #include <linux/sched.h>
 #include <linux/smp_lock.h>
 #include <linux/ext3_jbd.h>
-
 #include <linux/errno.h>
 #include <linux/slab.h>
 
@@ -37,7 +36,7 @@ static int verify_group_input(struct sup
 		 le16_to_cpu(es->s_reserved_gdt_blocks)) : 0;
 	ext3_fsblk_t metaend = start + overhead;
 	struct buffer_head *bh = NULL;
-	ext3_grpblk_t free_blocks_count;
+	ext3_grpblk_t free_blocks_count, offset;
 	int err = -EINVAL;
 
 	input->free_blocks_count = free_blocks_count =
@@ -50,13 +49,13 @@ static int verify_group_input(struct sup
 		       "no-super", input->group, input->blocks_count,
 		       free_blocks_count, input->reserved_blocks);
 
+	ext3_get_group_no_and_offset(sb, start, NULL, &offset);
 	if (group != sbi->s_groups_count)
 		ext3_warning(sb, __FUNCTION__,
 			     "Cannot add at group %u (only %lu groups)",
 			     input->group, sbi->s_groups_count);
-	else if ((start - le32_to_cpu(es->s_first_data_block)) %
-		 EXT3_BLOCKS_PER_GROUP(sb))
-		ext3_warning(sb, __FUNCTION__, "Last group not full");
+	else if (offset != 0)
+			ext3_warning(sb, __FUNCTION__, "Last group not full");
 	else if (input->reserved_blocks > input->blocks_count / 5)
 		ext3_warning(sb, __FUNCTION__, "Reserved blocks too high (%u)",
 			     input->reserved_blocks);
@@ -933,7 +932,7 @@ int ext3_group_extend(struct super_block
 
 	if (n_blocks_count > (sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) {
 		printk(KERN_ERR "EXT3-fs: filesystem on %s:"
-			" too large to resize to %lu blocks safely\n",
+			" too large to resize to "E3FSBLK" blocks safely\n",
 			sb->s_id, n_blocks_count);
 		if (sizeof(sector_t) < 8)
 			ext3_warning(sb, __FUNCTION__,
@@ -948,8 +947,7 @@ int ext3_group_extend(struct super_block
 	}
 
 	/* Handle the remaining blocks in the last group only. */
-	last = (o_blocks_count - le32_to_cpu(es->s_first_data_block)) %
-		EXT3_BLOCKS_PER_GROUP(sb);
+	ext3_get_group_no_and_offset(sb, o_blocks_count, NULL, &last);
 
 	if (last == 0) {
 		ext3_warning(sb, __FUNCTION__,
diff -puN fs/ext3/super.c~ext3_fsblk_sector_t fs/ext3/super.c
--- linux-2.6.17/fs/ext3/super.c~ext3_fsblk_sector_t	2006-06-28 16:46:36.069317241 -0700
+++ linux-2.6.17-ming/fs/ext3/super.c	2006-06-28 16:46:36.090314832 -0700
@@ -1388,8 +1388,8 @@ static int ext3_fill_super (struct super
 	 * block sizes.  We need to calculate the offset from buffer start.
 	 */
 	if (blocksize != EXT3_MIN_BLOCK_SIZE) {
-		logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
-		offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
+		logic_sb_block = sb_block * EXT3_MIN_BLOCK_SIZE;
+		offset = sector_div(logic_sb_block, blocksize);
 	} else {
 		logic_sb_block = sb_block;
 	}
@@ -1494,8 +1494,8 @@ static int ext3_fill_super (struct super
 
 		brelse (bh);
 		sb_set_blocksize(sb, blocksize);
-		logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
-		offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
+		logic_sb_block = sb_block * EXT3_MIN_BLOCK_SIZE;
+		offset = sector_div(logic_sb_block, blocksize);
 		bh = sb_bread(sb, logic_sb_block);
 		if (!bh) {
 			printk(KERN_ERR 
diff -puN include/linux/ext3_fs.h~ext3_fsblk_sector_t include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~ext3_fsblk_sector_t	2006-06-28 16:46:36.073316783 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h	2006-06-28 16:46:36.092314603 -0700
@@ -17,6 +17,7 @@
 #define _LINUX_EXT3_FS_H
 
 #include <linux/types.h>
+#include <linux/blkdev.h>
 
 /*
  * The second extended filesystem constants/structures
@@ -728,6 +729,27 @@ ext3_group_first_block_no(struct super_b
 #define ERR_BAD_DX_DIR	-75000
 
 /*
+ * This function calculate the block group number and offset,
+ * given a block number
+ */
+
+static inline void ext3_get_group_no_and_offset(struct super_block * sb,
+                                ext3_fsblk_t blocknr, unsigned long* blockgrpp,
+                                ext3_grpblk_t *offsetp)
+{
+        struct ext3_super_block *es = EXT3_SB(sb)->s_es;
+	ext3_grpblk_t offset;
+
+        blocknr = blocknr - le32_to_cpu(es->s_first_data_block);
+        offset = sector_div(blocknr, EXT3_BLOCKS_PER_GROUP(sb));
+	if (offsetp)
+		*offsetp = offset;
+	if (blockgrpp)
+	        *blockgrpp = blocknr;
+
+}
+
+/*
  * Function prototypes
  */
 
@@ -740,6 +762,10 @@ ext3_group_first_block_no(struct super_b
 # define NORET_AND     noreturn,
 
 /* balloc.c */
+extern unsigned int ext3_block_group(struct super_block *sb,
+			ext3_fsblk_t blocknr);
+extern ext3_grpblk_t ext3_block_group_offset(struct super_block *sb,
+			ext3_fsblk_t blocknr);
 extern int ext3_bg_has_super(struct super_block *sb, int group);
 extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
 extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode,
diff -puN include/linux/ext3_fs_i.h~ext3_fsblk_sector_t include/linux/ext3_fs_i.h
--- linux-2.6.17/include/linux/ext3_fs_i.h~ext3_fsblk_sector_t	2006-06-28 16:46:36.077316324 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_i.h	2006-06-28 16:46:36.093314488 -0700
@@ -25,9 +25,9 @@
 typedef int ext3_grpblk_t;
 
 /* data type for filesystem-wide blocks number */
-typedef unsigned long ext3_fsblk_t;
+typedef sector_t ext3_fsblk_t;
 
-#define E3FSBLK "%lu"
+#define E3FSBLK SECTOR_FMT
 
 struct ext3_reserve_window {
 	ext3_fsblk_t	_rsv_start;	/* First byte reserved */

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 4/16]support 48 bit blk number in extents
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (6 preceding siblings ...)
  2006-06-30  0:17 ` [RFC][Update][Patch 3/16]convert ext3_fsblk_t to sector_t to support >32 bit block in kernel Mingming Cao
@ 2006-06-30  0:17 ` Mingming Cao
  2006-06-30  0:17 ` [RFC][Update][Patch 5/16]block type convert " Mingming Cao
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

48bit physical block number support in extents.

Signed-Off-By: Alex Tomas <alex@clusterfs.com>


---

 linux-2.6.17-ming/fs/ext3/extents.c               |  138 +++++++++++++---------
 linux-2.6.17-ming/include/linux/ext3_fs_extents.h |    2 
 linux-2.6.17-ming/include/linux/ext3_fs_i.h       |    2 
 3 files changed, 87 insertions(+), 55 deletions(-)

diff -puN fs/ext3/extents.c~ext3-extents-48bit fs/ext3/extents.c
--- linux-2.6.17/fs/ext3/extents.c~ext3-extents-48bit	2006-06-28 16:46:39.848883567 -0700
+++ linux-2.6.17-ming/fs/ext3/extents.c	2006-06-28 16:46:39.863881846 -0700
@@ -44,6 +44,44 @@
 #include <asm/uaccess.h>
 
 
+/* this macro combines low and hi parts of phys. blocknr into sector_t */
+static inline sector_t ext_pblock(struct ext3_extent *ex)
+{
+	sector_t block;
+
+	block = le32_to_cpu(ex->ee_start);
+	if (sizeof(sector_t) > 4)
+		block |= ((sector_t) le16_to_cpu(ex->ee_start_hi) << 31) << 1;
+	return block;
+}
+
+/* this macro combines low and hi parts of phys. blocknr into sector_t */
+static inline sector_t idx_pblock(struct ext3_extent_idx *ix)
+{
+	sector_t block;
+
+	block = le32_to_cpu(ix->ei_leaf);
+	if (sizeof(sector_t) > 4)
+		block |= ((sector_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
+	return block;
+}
+
+/* the routine stores large phys. blocknr into extent breaking it into parts */
+static inline void ext3_ext_store_pblock(struct ext3_extent *ex, sector_t pb)
+{
+	ex->ee_start = cpu_to_le32((unsigned long) (pb & 0xffffffff));
+	if (sizeof(sector_t) > 4)
+		ex->ee_start_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
+}
+
+/* the routine stores large phys. blocknr into index breaking it into parts */
+static inline void ext3_idx_store_pblock(struct ext3_extent_idx *ix, sector_t pb)
+{
+	ix->ei_leaf = cpu_to_le32((unsigned long) (pb & 0xffffffff));
+	if (sizeof(sector_t) > 4)
+		ix->ei_leaf_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
+}
+
 static int ext3_ext_check_header(const char *function, struct inode *inode,
 				struct ext3_extent_header *eh)
 {
@@ -126,7 +164,7 @@ static int ext3_ext_dirty(handle_t *hand
 
 static int ext3_ext_find_goal(struct inode *inode,
 			      struct ext3_ext_path *path,
-			      unsigned long block)
+			      sector_t block)
 {
 	struct ext3_inode_info *ei = EXT3_I(inode);
 	unsigned long bg_start;
@@ -139,8 +177,7 @@ static int ext3_ext_find_goal(struct ino
 
 		/* try to predict block placement */
 		if ((ex = path[depth].p_ext))
-			return le32_to_cpu(ex->ee_start)
-					+ (block - le32_to_cpu(ex->ee_block));
+			return ext_pblock(ex)+(block-le32_to_cpu(ex->ee_block));
 
 		/* it looks index is empty
 		 * try to find starting from index itself */
@@ -230,13 +267,13 @@ static void ext3_ext_show_path(struct in
 	ext_debug("path:");
 	for (k = 0; k <= l; k++, path++) {
 		if (path->p_idx) {
-		  ext_debug("  %d->%d", le32_to_cpu(path->p_idx->ei_block),
-			    le32_to_cpu(path->p_idx->ei_leaf));
+		  ext_debug("  %d->%llu", le32_to_cpu(path->p_idx->ei_block),
+			    idx_pblock(path->p_idx));
 		} else if (path->p_ext) {
-			ext_debug("  %d:%d:%d",
+			ext_debug("  %d:%d:%lld",
 				  le32_to_cpu(path->p_ext->ee_block),
 				  le16_to_cpu(path->p_ext->ee_len),
-				  le32_to_cpu(path->p_ext->ee_start));
+				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
 	}
@@ -257,9 +294,8 @@ static void ext3_ext_show_leaf(struct in
 	ex = EXT_FIRST_EXTENT(eh);
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
-		ext_debug("%d:%d:%d ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len),
-			  le32_to_cpu(ex->ee_start));
+		ext_debug("%d:%d:%lld ", le32_to_cpu(ex->ee_block),
+			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -308,8 +344,8 @@ ext3_ext_binsearch_idx(struct inode *ino
 	}
 
 	path->p_idx = l - 1;
-	ext_debug("  -> %d->%d ", le32_to_cpu(path->p_idx->ei_block),
-		  le32_to_cpu(path->p_idx->ei_leaf));
+	ext_debug("  -> %d->%lld ", le32_to_cpu(path->p_idx->ei_block),
+		  idx_block(path->p_idx));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -374,10 +410,10 @@ ext3_ext_binsearch(struct inode *inode, 
 	}
 
 	path->p_ext = l - 1;
-	ext_debug("  -> %d:%d:%d ",
+	ext_debug("  -> %d:%lld:%d ",
 		        le32_to_cpu(path->p_ext->ee_block),
-		        le32_to_cpu(path->p_ext->ee_start),
-		        le16_to_cpu(path->p_ext->ee_len));
+		        ext_pblock(path->p_ext),
+			le16_to_cpu(path->p_ext->ee_len));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -442,7 +478,7 @@ ext3_ext_find_extent(struct inode *inode
 		ext_debug("depth %d: num %d, max %d\n",
 			  ppos, le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
 		ext3_ext_binsearch_idx(inode, path + ppos, block);
-		path[ppos].p_block = le32_to_cpu(path[ppos].p_idx->ei_leaf);
+		path[ppos].p_block = idx_pblock(path[ppos].p_idx);
 		path[ppos].p_depth = i;
 		path[ppos].p_ext = NULL;
 
@@ -524,7 +560,7 @@ static int ext3_ext_insert_index(handle_
 	}
 
 	ix->ei_block = cpu_to_le32(logical);
-	ix->ei_leaf = cpu_to_le32(ptr);
+	ext3_idx_store_pblock(ix, ptr);
 	curp->p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(curp->p_hdr->eh_entries)+1);
 
 	BUG_ON(le16_to_cpu(curp->p_hdr->eh_entries)
@@ -633,9 +669,9 @@ static int ext3_ext_split(handle_t *hand
 	path[depth].p_ext++;
 	while (path[depth].p_ext <=
 			EXT_MAX_EXTENT(path[depth].p_hdr)) {
-		ext_debug("move %d:%d:%d in new leaf %lu\n",
+		ext_debug("move %d:%lld:%d in new leaf %lu\n",
 			        le32_to_cpu(path[depth].p_ext->ee_block),
-			        le32_to_cpu(path[depth].p_ext->ee_start),
+			        ext_pblock(path[depth].p_ext),
 			        le16_to_cpu(path[depth].p_ext->ee_len),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
@@ -696,7 +732,7 @@ static int ext3_ext_split(handle_t *hand
 		neh->eh_depth = cpu_to_le16(depth - i);
 		fidx = EXT_FIRST_INDEX(neh);
 		fidx->ei_block = border;
-		fidx->ei_leaf = cpu_to_le32(oldblock);
+		ext3_idx_store_pblock(fidx, oldblock);
 
 		ext_debug("int.index at %d (block %lu): %lu -> %lu\n", i,
 				newblock, (unsigned long) le32_to_cpu(border),
@@ -710,9 +746,9 @@ static int ext3_ext_split(handle_t *hand
 		BUG_ON(EXT_MAX_INDEX(path[i].p_hdr) !=
 				EXT_LAST_INDEX(path[i].p_hdr));
 		while (path[i].p_idx <= EXT_MAX_INDEX(path[i].p_hdr)) {
-			ext_debug("%d: move %d:%d in new index %lu\n", i,
+			ext_debug("%d: move %d:%d in new index %llu\n", i,
 				        le32_to_cpu(path[i].p_idx->ei_block),
-				        le32_to_cpu(path[i].p_idx->ei_leaf),
+				        idx_pblock(path[i].p_idx),
 				        newblock);
 			/*memmove(++fidx, path[i].p_idx++,
 					sizeof(struct ext3_extent_idx));
@@ -839,13 +875,13 @@ static int ext3_ext_grow_indepth(handle_
 	curp->p_idx = EXT_FIRST_INDEX(curp->p_hdr);
 	/* FIXME: it works, but actually path[0] can be index */
 	curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block;
-	curp->p_idx->ei_leaf = cpu_to_le32(newblock);
+	ext3_idx_store_pblock(curp->p_idx, newblock);
 
 	neh = ext_inode_hdr(inode);
 	fidx = EXT_FIRST_INDEX(neh);
-	ext_debug("new root: num %d(%d), lblock %d, ptr %d\n",
+	ext_debug("new root: num %d(%d), lblock %d, ptr %llu\n",
 		  le16_to_cpu(neh->eh_entries), le16_to_cpu(neh->eh_max),
-		  le32_to_cpu(fidx->ei_block), le32_to_cpu(fidx->ei_leaf));
+		  le32_to_cpu(fidx->ei_block), idx_pblock(fidx));
 
 	neh->eh_depth = cpu_to_le16(path->p_depth + 1);
 	err = ext3_ext_dirty(handle, inode, curp);
@@ -1042,7 +1078,6 @@ static int inline
 ext3_can_extents_be_merged(struct inode *inode, struct ext3_extent *ex1,
 				struct ext3_extent *ex2)
 {
-	/* FIXME: 48bit support */
         if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len)
 	    != le32_to_cpu(ex2->ee_block))
 		return 0;
@@ -1052,8 +1087,7 @@ ext3_can_extents_be_merged(struct inode 
 		return 0;
 #endif
 
-        if (le32_to_cpu(ex1->ee_start) + le16_to_cpu(ex1->ee_len)
-	    		== le32_to_cpu(ex2->ee_start))
+        if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
 		return 1;
 	return 0;
 }
@@ -1080,11 +1114,10 @@ int ext3_ext_insert_extent(handle_t *han
 
 	/* try to insert block into found extent and return */
 	if (ex && ext3_can_extents_be_merged(inode, ex, newext)) {
-		ext_debug("append %d block to %d:%d (from %d)\n",
+		ext_debug("append %d block to %d:%d (from %lld)\n",
 				le16_to_cpu(newext->ee_len),
 				le32_to_cpu(ex->ee_block),
-				le16_to_cpu(ex->ee_len),
-				le32_to_cpu(ex->ee_start));
+				le16_to_cpu(ex->ee_len), ext_pblock(ex));
 		if ((err = ext3_ext_get_access(handle, inode, path + depth)))
 			return err;
 		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
@@ -1140,9 +1173,9 @@ has_space:
 
 	if (!nearex) {
 		/* there is no extent in this leaf, create first one */
-		ext_debug("first extent in the leaf: %d:%d:%d\n",
+		ext_debug("first extent in the leaf: %d:%lld:%d\n",
 			        le32_to_cpu(newext->ee_block),
-			        le32_to_cpu(newext->ee_start),
+			        ext_pblock(newext),
 			        le16_to_cpu(newext->ee_len));
 		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
 	} else if (le32_to_cpu(newext->ee_block)
@@ -1152,10 +1185,10 @@ has_space:
 			len = EXT_MAX_EXTENT(eh) - nearex;
 			len = (len - 1) * sizeof(struct ext3_extent);
 			len = len < 0 ? 0 : len;
-			ext_debug("insert %d:%d:%d after: nearest 0x%p, "
+			ext_debug("insert %d:%lld:%d after: nearest 0x%p, "
 					"move %d from 0x%p to 0x%p\n",
 				        le32_to_cpu(newext->ee_block),
-				        le32_to_cpu(newext->ee_start),
+				        ext_pblock(newext),
 				        le16_to_cpu(newext->ee_len),
 					nearex, len, nearex + 1, nearex + 2);
 			memmove(nearex + 2, nearex + 1, len);
@@ -1165,10 +1198,10 @@ has_space:
  		BUG_ON(newext->ee_block == nearex->ee_block);
 		len = (EXT_MAX_EXTENT(eh) - nearex) * sizeof(struct ext3_extent);
 		len = len < 0 ? 0 : len;
-		ext_debug("insert %d:%d:%d before: nearest 0x%p, "
+		ext_debug("insert %d:%lld:%d before: nearest 0x%p, "
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
-				le32_to_cpu(newext->ee_start),
+				ext_pblock(newext),
 				le16_to_cpu(newext->ee_len),
 				nearex, len, nearex + 1, nearex + 2);
 		memmove(nearex + 1, nearex, len);
@@ -1179,9 +1212,8 @@ has_space:
 	nearex = path[depth].p_ext;
 	nearex->ee_block = newext->ee_block;
 	nearex->ee_start = newext->ee_start;
+	nearex->ee_start_hi = newext->ee_start_hi;
 	nearex->ee_len = newext->ee_len;
-	/* FIXME: support for large fs */
-	nearex->ee_start_hi = 0;
 
 merge:
 	/* try to merge extents to the right */
@@ -1290,7 +1322,7 @@ int ext3_ext_walk_space(struct inode *in
 		} else {
 		        cbex.ec_block = le32_to_cpu(ex->ee_block);
 		        cbex.ec_len = le16_to_cpu(ex->ee_len);
-		        cbex.ec_start = le32_to_cpu(ex->ee_start);
+		        cbex.ec_start = ext_pblock(ex);
 			cbex.ec_type = EXT3_EXT_CACHE_EXTENT;
 		}
 
@@ -1398,7 +1430,7 @@ ext3_ext_in_cache(struct inode *inode, u
 			cex->ec_type != EXT3_EXT_CACHE_EXTENT);
 	if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) {
 	        ex->ee_block = cpu_to_le32(cex->ec_block);
-	        ex->ee_start = cpu_to_le32(cex->ec_start);
+		ext3_ext_store_pblock(ex, cex->ec_start);
 	        ex->ee_len = cpu_to_le16(cex->ec_len);
 		ext_debug("%lu cached by %lu:%lu:%lu\n",
 				(unsigned long) block,
@@ -1426,7 +1458,7 @@ int ext3_ext_rm_idx(handle_t *handle, st
 
 	/* free index block */
 	path--;
-	leaf = le32_to_cpu(path->p_idx->ei_leaf);
+	leaf = idx_pblock(path->p_idx);
 	BUG_ON(path->p_hdr->eh_entries == 0);
 	if ((err = ext3_ext_get_access(handle, inode, path)))
 		return err;
@@ -1517,7 +1549,7 @@ static int ext3_remove_blocks(handle_t *
 		/* tail removal */
 		unsigned long num, start;
 		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
-		start = le32_to_cpu(ex->ee_start) + le16_to_cpu(ex->ee_len) - num;
+		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
 		ext_debug("free last %lu blocks starting %lu\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1621,7 +1653,7 @@ ext3_ext_rm_leaf(handle_t *handle, struc
 
 		if (num == 0) {
 			/* this extent is removed entirely mark slot unused */
-			ex->ee_start = 0;
+			ext3_ext_store_pblock(ex, 0);
 			eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
 		}
 
@@ -1632,8 +1664,8 @@ ext3_ext_rm_leaf(handle_t *handle, struc
 		if (err)
 			goto out;
 
-		ext_debug("new extent: %u:%u:%u\n", block, num,
-				le32_to_cpu(ex->ee_start));
+		ext_debug("new extent: %u:%u:%llu\n", block, num,
+				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
 		ex_ee_len = le16_to_cpu(ex->ee_len);
@@ -1748,11 +1780,11 @@ int ext3_ext_remove_space(struct inode *
 				path[i].p_idx);
 		if (ext3_ext_more_to_rm(path + i)) {
 			/* go to the next level */
-			ext_debug("move to level %d (block %d)\n",
-				  i + 1, le32_to_cpu(path[i].p_idx->ei_leaf));
+			ext_debug("move to level %d (block %llu)\n",
+				  i + 1, idx_pblock(path[i].p_idx));
 			memset(path + i + 1, 0, sizeof(*path));
 			path[i+1].p_bh =
-				sb_bread(sb, le32_to_cpu(path[i].p_idx->ei_leaf));
+				sb_bread(sb, idx_pblock(path[i].p_idx));
 			if (!path[i+1].p_bh) {
 				/* should we reset i_size? */
 				err = -EIO;
@@ -1878,7 +1910,7 @@ int ext3_ext_get_blocks(handle_t *handle
 			/* block is already allocated */
 		        newblock = iblock
 		                   - le32_to_cpu(newex.ee_block)
-			           + le32_to_cpu(newex.ee_start);
+			           + ext_pblock(&newex);
 			/* number of remain blocks in the extent */
 			allocated = le16_to_cpu(newex.ee_len) -
 					(iblock - le32_to_cpu(newex.ee_block));
@@ -1907,7 +1939,7 @@ int ext3_ext_get_blocks(handle_t *handle
 
 	if ((ex = path[depth].p_ext)) {
 	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
-		unsigned long ee_start = le32_to_cpu(ex->ee_start);
+		unsigned long ee_start = ext_pblock(ex);
 		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
 		/* if found exent covers block, simple return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -1943,7 +1975,7 @@ int ext3_ext_get_blocks(handle_t *handle
 
 	/* try to insert new extent into found leaf and return */
 	newex.ee_block = cpu_to_le32(iblock);
-	newex.ee_start = cpu_to_le32(newblock);
+	ext3_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
 	err = ext3_ext_insert_extent(handle, inode, path, &newex);
 	if (err)
@@ -1953,7 +1985,7 @@ int ext3_ext_get_blocks(handle_t *handle
 		EXT3_I(inode)->i_disksize = inode->i_size;
 
 	/* previous routine could use block we allocated */
-	newblock = le32_to_cpu(newex.ee_start);
+	newblock = ext_pblock(&newex);
 	__set_bit(BH_New, &bh_result->b_state);
 
 	ext3_ext_put_in_cache(inode, iblock, allocated, newblock,
diff -puN include/linux/ext3_fs_extents.h~ext3-extents-48bit include/linux/ext3_fs_extents.h
--- linux-2.6.17/include/linux/ext3_fs_extents.h~ext3-extents-48bit	2006-06-28 16:46:39.851883223 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_extents.h	2006-06-28 16:46:39.864881731 -0700
@@ -108,7 +108,7 @@ struct ext3_extent_header {
  * truncate uses it to simulate recursive walking
  */
 struct ext3_ext_path {
-	__u32				p_block;
+	__u64				p_block;
 	__u16				p_depth;
 	struct ext3_extent		*p_ext;
 	struct ext3_extent_idx		*p_idx;
diff -puN include/linux/ext3_fs_i.h~ext3-extents-48bit include/linux/ext3_fs_i.h
--- linux-2.6.17/include/linux/ext3_fs_i.h~ext3-extents-48bit	2006-06-28 16:46:39.855882764 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_i.h	2006-06-28 16:46:39.864881731 -0700
@@ -68,7 +68,7 @@ struct ext3_block_alloc_info {
  * storage for cached extent
  */
 struct ext3_ext_cache {
-	__u32	ec_start;
+	sector_t ec_start;
 	__u32	ec_block;
 	__u32	ec_len; /* must be 32bit to return holes */
 	__u32	ec_type;

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 5/16]block type convert in extents
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (7 preceding siblings ...)
  2006-06-30  0:17 ` [RFC][Update][Patch 4/16]support 48 bit blk number in extents Mingming Cao
@ 2006-06-30  0:17 ` Mingming Cao
  2006-06-30  0:17 ` [RFC][Update][Patch 6/16]handing unitialized extents Mingming Cao
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

convert in-kernel filesystem blocks type to ext3_fsblk_t.

Signed-Off-By: Avantika Mathur <mathur@us.ibm.com> 
Acked-By: Alex Tomas <alex@us.ibm.com>


---

 linux-2.6.17-ming/fs/ext3/extents.c               |  106 +++++++++++-----------
 linux-2.6.17-ming/include/linux/ext3_fs_extents.h |    2 
 linux-2.6.17-ming/include/linux/ext3_fs_i.h       |    8 -
 3 files changed, 59 insertions(+), 57 deletions(-)

diff -puN fs/ext3/extents.c~ext3-extents-ext3_fsblk_t fs/ext3/extents.c
--- linux-2.6.17/fs/ext3/extents.c~ext3-extents-ext3_fsblk_t	2006-06-28 16:46:45.589224909 -0700
+++ linux-2.6.17-ming/fs/ext3/extents.c	2006-06-28 16:46:45.603223303 -0700
@@ -44,41 +44,41 @@
 #include <asm/uaccess.h>
 
 
-/* this macro combines low and hi parts of phys. blocknr into sector_t */
-static inline sector_t ext_pblock(struct ext3_extent *ex)
+/* this macro combines low and hi parts of phys. blocknr into ext3_fsblk_t */
+static inline ext3_fsblk_t ext_pblock(struct ext3_extent *ex)
 {
-	sector_t block;
+	ext3_fsblk_t block;
 
 	block = le32_to_cpu(ex->ee_start);
-	if (sizeof(sector_t) > 4)
-		block |= ((sector_t) le16_to_cpu(ex->ee_start_hi) << 31) << 1;
+	if (sizeof(ext3_fsblk_t) > 4)
+		block |= ((ext3_fsblk_t) le16_to_cpu(ex->ee_start_hi) << 31) << 1;
 	return block;
 }
 
-/* this macro combines low and hi parts of phys. blocknr into sector_t */
-static inline sector_t idx_pblock(struct ext3_extent_idx *ix)
+/* this macro combines low and hi parts of phys. blocknr into ext3_fsblk_t */
+static inline ext3_fsblk_t idx_pblock(struct ext3_extent_idx *ix)
 {
-	sector_t block;
+	ext3_fsblk_t block;
 
 	block = le32_to_cpu(ix->ei_leaf);
-	if (sizeof(sector_t) > 4)
-		block |= ((sector_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
+	if (sizeof(ext3_fsblk_t) > 4)
+		block |= ((ext3_fsblk_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
 	return block;
 }
 
 /* the routine stores large phys. blocknr into extent breaking it into parts */
-static inline void ext3_ext_store_pblock(struct ext3_extent *ex, sector_t pb)
+static inline void ext3_ext_store_pblock(struct ext3_extent *ex, ext3_fsblk_t pb)
 {
 	ex->ee_start = cpu_to_le32((unsigned long) (pb & 0xffffffff));
-	if (sizeof(sector_t) > 4)
+	if (sizeof(ext3_fsblk_t) > 4)
 		ex->ee_start_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
 }
 
 /* the routine stores large phys. blocknr into index breaking it into parts */
-static inline void ext3_idx_store_pblock(struct ext3_extent_idx *ix, sector_t pb)
+static inline void ext3_idx_store_pblock(struct ext3_extent_idx *ix, ext3_fsblk_t pb)
 {
 	ix->ei_leaf = cpu_to_le32((unsigned long) (pb & 0xffffffff));
-	if (sizeof(sector_t) > 4)
+	if (sizeof(ext3_fsblk_t) > 4)
 		ix->ei_leaf_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
 }
 
@@ -162,13 +162,13 @@ static int ext3_ext_dirty(handle_t *hand
 	return err;
 }
 
-static int ext3_ext_find_goal(struct inode *inode,
+static ext3_fsblk_t ext3_ext_find_goal(struct inode *inode,
 			      struct ext3_ext_path *path,
-			      sector_t block)
+			      ext3_fsblk_t block)
 {
 	struct ext3_inode_info *ei = EXT3_I(inode);
-	unsigned long bg_start;
-	unsigned long colour;
+	ext3_fsblk_t bg_start;
+	ext3_grpblk_t colour;
 	int depth;
 
 	if (path) {
@@ -193,12 +193,12 @@ static int ext3_ext_find_goal(struct ino
 	return bg_start + colour + block;
 }
 
-static int
+static ext3_fsblk_t
 ext3_ext_new_block(handle_t *handle, struct inode *inode,
 			struct ext3_ext_path *path,
 			struct ext3_extent *ex, int *err)
 {
-	int goal, newblock;
+	ext3_fsblk_t goal, newblock;
 
 	goal = ext3_ext_find_goal(inode, path, le32_to_cpu(ex->ee_block));
 	newblock = ext3_new_block(handle, inode, goal, err);
@@ -267,10 +267,10 @@ static void ext3_ext_show_path(struct in
 	ext_debug("path:");
 	for (k = 0; k <= l; k++, path++) {
 		if (path->p_idx) {
-		  ext_debug("  %d->%llu", le32_to_cpu(path->p_idx->ei_block),
+		  ext_debug("  %d->"E3FSBLK, le32_to_cpu(path->p_idx->ei_block),
 			    idx_pblock(path->p_idx));
 		} else if (path->p_ext) {
-			ext_debug("  %d:%d:%lld",
+			ext_debug("  %d:%d:"E3FSBLK" ",
 				  le32_to_cpu(path->p_ext->ee_block),
 				  le16_to_cpu(path->p_ext->ee_len),
 				  ext_pblock(path->p_ext));
@@ -294,7 +294,7 @@ static void ext3_ext_show_leaf(struct in
 	ex = EXT_FIRST_EXTENT(eh);
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
-		ext_debug("%d:%d:%lld ", le32_to_cpu(ex->ee_block),
+		ext_debug("%d:%d:"E3FSBLK" ", le32_to_cpu(ex->ee_block),
 			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
 	}
 	ext_debug("\n");
@@ -410,7 +410,7 @@ ext3_ext_binsearch(struct inode *inode, 
 	}
 
 	path->p_ext = l - 1;
-	ext_debug("  -> %d:%lld:%d ",
+	ext_debug("  -> %d:"E3FSBLK":%d ",
 		        le32_to_cpu(path->p_ext->ee_block),
 		        ext_pblock(path->p_ext),
 			le16_to_cpu(path->p_ext->ee_len));
@@ -525,7 +525,7 @@ err:
  */
 static int ext3_ext_insert_index(handle_t *handle, struct inode *inode,
 				struct ext3_ext_path *curp,
-				int logical, int ptr)
+				int logical, ext3_fsblk_t ptr)
 {
 	struct ext3_extent_idx *ix;
 	int len, err;
@@ -592,9 +592,9 @@ static int ext3_ext_split(handle_t *hand
 	struct ext3_extent_idx *fidx;
 	struct ext3_extent *ex;
 	int i = at, k, m, a;
-	unsigned long newblock, oldblock;
+	ext3_fsblk_t newblock, oldblock;
 	__le32 border;
-	int *ablocks = NULL; /* array of allocated blocks */
+	ext3_fsblk_t *ablocks = NULL; /* array of allocated blocks */
 	int err = 0;
 
 	/* make decision: where to split? */
@@ -627,10 +627,10 @@ static int ext3_ext_split(handle_t *hand
 	 * we need this to handle errors and free blocks
 	 * upon them
 	 */
-	ablocks = kmalloc(sizeof(unsigned long) * depth, GFP_NOFS);
+	ablocks = kmalloc(sizeof(ext3_fsblk_t) * depth, GFP_NOFS);
 	if (!ablocks)
 		return -ENOMEM;
-	memset(ablocks, 0, sizeof(unsigned long) * depth);
+	memset(ablocks, 0, sizeof(ext3_fsblk_t) * depth);
 
 	/* allocate all needed blocks */
 	ext_debug("allocate %d blocks for indexes/leaf\n", depth - at);
@@ -669,7 +669,7 @@ static int ext3_ext_split(handle_t *hand
 	path[depth].p_ext++;
 	while (path[depth].p_ext <=
 			EXT_MAX_EXTENT(path[depth].p_hdr)) {
-		ext_debug("move %d:%lld:%d in new leaf %lu\n",
+		ext_debug("move %d:"E3FSBLK":%d in new leaf "E3FSBLK"\n",
 			        le32_to_cpu(path[depth].p_ext->ee_block),
 			        ext_pblock(path[depth].p_ext),
 			        le16_to_cpu(path[depth].p_ext->ee_len),
@@ -715,7 +715,7 @@ static int ext3_ext_split(handle_t *hand
 	while (k--) {
 		oldblock = newblock;
 		newblock = ablocks[--a];
-		bh = sb_getblk(inode->i_sb, newblock);
+		bh = sb_getblk(inode->i_sb, (ext3_fsblk_t)newblock);
 		if (!bh) {
 			err = -EIO;
 			goto cleanup;
@@ -734,7 +734,7 @@ static int ext3_ext_split(handle_t *hand
 		fidx->ei_block = border;
 		ext3_idx_store_pblock(fidx, oldblock);
 
-		ext_debug("int.index at %d (block %lu): %lu -> %lu\n", i,
+		ext_debug("int.index at %d (block "E3FSBLK"): %lu -> "E3FSBLK"\n", i,
 				newblock, (unsigned long) le32_to_cpu(border),
 			  	oldblock);
 		/* copy indexes */
@@ -746,7 +746,7 @@ static int ext3_ext_split(handle_t *hand
 		BUG_ON(EXT_MAX_INDEX(path[i].p_hdr) !=
 				EXT_LAST_INDEX(path[i].p_hdr));
 		while (path[i].p_idx <= EXT_MAX_INDEX(path[i].p_hdr)) {
-			ext_debug("%d: move %d:%d in new index %llu\n", i,
+			ext_debug("%d: move %d:%d in new index "E3FSBLK"\n", i,
 				        le32_to_cpu(path[i].p_idx->ei_block),
 				        idx_pblock(path[i].p_idx),
 				        newblock);
@@ -827,7 +827,7 @@ static int ext3_ext_grow_indepth(handle_
 	struct ext3_extent_header *neh;
 	struct ext3_extent_idx *fidx;
 	struct buffer_head *bh;
-	unsigned long newblock;
+	ext3_fsblk_t newblock;
 	int err = 0;
 
 	newblock = ext3_ext_new_block(handle, inode, path, newext, &err);
@@ -879,7 +879,7 @@ static int ext3_ext_grow_indepth(handle_
 
 	neh = ext_inode_hdr(inode);
 	fidx = EXT_FIRST_INDEX(neh);
-	ext_debug("new root: num %d(%d), lblock %d, ptr %llu\n",
+	ext_debug("new root: num %d(%d), lblock %d, ptr "E3FSBLK"\n",
 		  le16_to_cpu(neh->eh_entries), le16_to_cpu(neh->eh_max),
 		  le32_to_cpu(fidx->ei_block), idx_pblock(fidx));
 
@@ -1114,7 +1114,7 @@ int ext3_ext_insert_extent(handle_t *han
 
 	/* try to insert block into found extent and return */
 	if (ex && ext3_can_extents_be_merged(inode, ex, newext)) {
-		ext_debug("append %d block to %d:%d (from %lld)\n",
+		ext_debug("append %d block to %d:%d (from "E3FSBLK")\n",
 				le16_to_cpu(newext->ee_len),
 				le32_to_cpu(ex->ee_block),
 				le16_to_cpu(ex->ee_len), ext_pblock(ex));
@@ -1173,7 +1173,7 @@ has_space:
 
 	if (!nearex) {
 		/* there is no extent in this leaf, create first one */
-		ext_debug("first extent in the leaf: %d:%lld:%d\n",
+		ext_debug("first extent in the leaf: %d:"E3FSBLK":%d\n",
 			        le32_to_cpu(newext->ee_block),
 			        ext_pblock(newext),
 			        le16_to_cpu(newext->ee_len));
@@ -1185,7 +1185,7 @@ has_space:
 			len = EXT_MAX_EXTENT(eh) - nearex;
 			len = (len - 1) * sizeof(struct ext3_extent);
 			len = len < 0 ? 0 : len;
-			ext_debug("insert %d:%lld:%d after: nearest 0x%p, "
+			ext_debug("insert %d:"E3FSBLK":%d after: nearest 0x%p, "
 					"move %d from 0x%p to 0x%p\n",
 				        le32_to_cpu(newext->ee_block),
 				        ext_pblock(newext),
@@ -1198,7 +1198,7 @@ has_space:
  		BUG_ON(newext->ee_block == nearex->ee_block);
 		len = (EXT_MAX_EXTENT(eh) - nearex) * sizeof(struct ext3_extent);
 		len = len < 0 ? 0 : len;
-		ext_debug("insert %d:%lld:%d before: nearest 0x%p, "
+		ext_debug("insert %d:"E3FSBLK":%d before: nearest 0x%p, "
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
@@ -1432,11 +1432,11 @@ ext3_ext_in_cache(struct inode *inode, u
 	        ex->ee_block = cpu_to_le32(cex->ec_block);
 		ext3_ext_store_pblock(ex, cex->ec_start);
 	        ex->ee_len = cpu_to_le16(cex->ec_len);
-		ext_debug("%lu cached by %lu:%lu:%lu\n",
+		ext_debug("%lu cached by %lu:%lu:"E3FSBLK"\n",
 				(unsigned long) block,
 				(unsigned long) cex->ec_block,
 				(unsigned long) cex->ec_len,
-				(unsigned long) cex->ec_start);
+				cex->ec_start);
 		return cex->ec_type;
 	}
 
@@ -1454,7 +1454,7 @@ int ext3_ext_rm_idx(handle_t *handle, st
 {
 	struct buffer_head *bh;
 	int err;
-	unsigned long leaf;
+	ext3_fsblk_t leaf;
 
 	/* free index block */
 	path--;
@@ -1465,7 +1465,7 @@ int ext3_ext_rm_idx(handle_t *handle, st
 	path->p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(path->p_hdr->eh_entries)-1);
 	if ((err = ext3_ext_dirty(handle, inode, path)))
 		return err;
-	ext_debug("index is empty, remove it, free block %lu\n", leaf);
+	ext_debug("index is empty, remove it, free block "E3FSBLK"\n", leaf);
 	bh = sb_find_get_block(inode->i_sb, leaf);
 	ext3_forget(handle, 1, inode, bh, leaf);
 	ext3_free_blocks(handle, inode, leaf, 1);
@@ -1547,10 +1547,11 @@ static int ext3_remove_blocks(handle_t *
 	if (from >= le32_to_cpu(ex->ee_block)
 	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
 		/* tail removal */
-		unsigned long num, start;
+		unsigned long num;
+		ext3_fsblk_t start;
 		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
 		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
-		ext_debug("free last %lu blocks starting %lu\n", num, start);
+		ext_debug("free last %lu blocks starting "E3FSBLK"\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
 			ext3_forget(handle, 0, inode, bh, start + i);
@@ -1664,7 +1665,7 @@ ext3_ext_rm_leaf(handle_t *handle, struc
 		if (err)
 			goto out;
 
-		ext_debug("new extent: %u:%u:%llu\n", block, num,
+		ext_debug("new extent: %u:%u:"E3FSBLK"\n", block, num,
 				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
@@ -1780,7 +1781,7 @@ int ext3_ext_remove_space(struct inode *
 				path[i].p_idx);
 		if (ext3_ext_more_to_rm(path + i)) {
 			/* go to the next level */
-			ext_debug("move to level %d (block %llu)\n",
+			ext_debug("move to level %d (block "E3FSBLK")\n",
 				  i + 1, idx_pblock(path[i].p_idx));
 			memset(path + i + 1, 0, sizeof(*path));
 			path[i+1].p_bh =
@@ -1883,13 +1884,14 @@ void ext3_ext_release(struct super_block
 #endif
 }
 
-int ext3_ext_get_blocks(handle_t *handle, struct inode *inode, sector_t iblock,
+int ext3_ext_get_blocks(handle_t *handle, struct inode *inode, ext3_fsblk_t iblock,
 			unsigned long max_blocks, struct buffer_head *bh_result,
 			int create, int extend_disksize)
 {
 	struct ext3_ext_path *path = NULL;
 	struct ext3_extent newex, *ex;
-	int goal, newblock, err = 0, depth;
+	ext3_fsblk_t goal, newblock;
+	int err = 0, depth;
 	unsigned long allocated = 0;
 
 	__clear_bit(BH_New, &bh_result->b_state);
@@ -1939,14 +1941,14 @@ int ext3_ext_get_blocks(handle_t *handle
 
 	if ((ex = path[depth].p_ext)) {
 	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
-		unsigned long ee_start = ext_pblock(ex);
+		ext3_fsblk_t ee_start = ext_pblock(ex);
 		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
 		/* if found exent covers block, simple return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
 			/* number of remain blocks in the extent */
 			allocated = ee_len - (iblock - ee_block);
-			ext_debug("%d fit into %lu:%d -> %d\n", (int) iblock,
+			ext_debug("%d fit into %lu:%d -> "E3FSBLK"\n", (int) iblock,
 					ee_block, ee_len, newblock);
 			ext3_ext_put_in_cache(inode, ee_block, ee_len,
 						ee_start, EXT3_EXT_CACHE_EXTENT);
@@ -1970,7 +1972,7 @@ int ext3_ext_get_blocks(handle_t *handle
 	newblock = ext3_new_blocks(handle, inode, goal, &allocated, &err);
 	if (!newblock)
 		goto out2;
-	ext_debug("allocate new block: goal %d, found %d/%lu\n",
+	ext_debug("allocate new block: goal "E3FSBLK", found "E3FSBLK"/%lu\n",
 			goal, newblock, allocated);
 
 	/* try to insert new extent into found leaf and return */
diff -puN include/linux/ext3_fs_extents.h~ext3-extents-ext3_fsblk_t include/linux/ext3_fs_extents.h
--- linux-2.6.17/include/linux/ext3_fs_extents.h~ext3-extents-ext3_fsblk_t	2006-06-28 16:46:45.592224565 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_extents.h	2006-06-28 16:46:45.604223188 -0700
@@ -108,7 +108,7 @@ struct ext3_extent_header {
  * truncate uses it to simulate recursive walking
  */
 struct ext3_ext_path {
-	__u64				p_block;
+	ext3_fsblk_t			p_block;
 	__u16				p_depth;
 	struct ext3_extent		*p_ext;
 	struct ext3_extent_idx		*p_idx;
diff -puN include/linux/ext3_fs_i.h~ext3-extents-ext3_fsblk_t include/linux/ext3_fs_i.h
--- linux-2.6.17/include/linux/ext3_fs_i.h~ext3-extents-ext3_fsblk_t	2006-06-28 16:46:45.596224106 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_i.h	2006-06-28 16:46:45.604223188 -0700
@@ -68,10 +68,10 @@ struct ext3_block_alloc_info {
  * storage for cached extent
  */
 struct ext3_ext_cache {
-	sector_t ec_start;
-	__u32	ec_block;
-	__u32	ec_len; /* must be 32bit to return holes */
-	__u32	ec_type;
+	ext3_fsblk_t	ec_start;
+	__u32		ec_block;
+	__u32		ec_len; /* must be 32bit to return holes */
+	__u32		ec_type;
 };
 
 /*

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 6/16]handing unitialized extents
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (8 preceding siblings ...)
  2006-06-30  0:17 ` [RFC][Update][Patch 5/16]block type convert " Mingming Cao
@ 2006-06-30  0:17 ` Mingming Cao
  2006-06-30  0:17 ` [RFC][Update][Patch 7/16]Core 64 bit JBD changes Mingming Cao
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

Make it possible to add file preallocation support in future as an
RO_COMPAT feature by recognizing uninitialized extents as holes and
limiting extent length to keep the top bit of ee_len free for marking
uninitialized extents.

Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>


---

 linux-2.6.17-ming/fs/ext3/extents.c               |   16 ++++++++++++++++
 linux-2.6.17-ming/include/linux/ext3_fs_extents.h |    2 ++
 2 files changed, 18 insertions(+)

diff -puN fs/ext3/extents.c~ext3-unitialized-extent-handling fs/ext3/extents.c
--- linux-2.6.17/fs/ext3/extents.c~ext3-unitialized-extent-handling	2006-06-28 16:46:49.657758078 -0700
+++ linux-2.6.17-ming/fs/ext3/extents.c	2006-06-28 16:46:49.667756930 -0700
@@ -1082,6 +1082,13 @@ ext3_can_extents_be_merged(struct inode 
 	    != le32_to_cpu(ex2->ee_block))
 		return 0;
 
+	/*
+	 * To allow future support for preallocated extents to be added
+	 * as an RO_COMPAT feature, refuse to merge to extents if
+	 * can result in the top bit of ee_len being set
+	 */
+	if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+		return 0;
 #ifdef AGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
 		return 0;
@@ -1943,6 +1950,15 @@ int ext3_ext_get_blocks(handle_t *handle
 	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
 		ext3_fsblk_t ee_start = ext_pblock(ex);
 		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+
+		/*
+		 * Allow future support for preallocated extents to be added
+		 * as an RO_COMPAT feature:
+		 * Uninitialized extents are treated as holes, except that
+		 * we avoid (fail) allocating new blocks during a write.
+		 */
+		if (ee_len > EXT_MAX_LEN)
+			goto out2;
 		/* if found exent covers block, simple return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
diff -puN include/linux/ext3_fs_extents.h~ext3-unitialized-extent-handling include/linux/ext3_fs_extents.h
--- linux-2.6.17/include/linux/ext3_fs_extents.h~ext3-unitialized-extent-handling	2006-06-28 16:46:49.661757619 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_extents.h	2006-06-28 16:46:49.668756816 -0700
@@ -141,6 +141,8 @@ typedef int (*ext_prepare_callback)(stru
 
 #define EXT_MAX_BLOCK	0xffffffff
 
+#define EXT_MAX_LEN	((1UL << 15) - 1)
+
 
 #define EXT_FIRST_EXTENT(__hdr__) \
 	((struct ext3_extent *) (((char *) (__hdr__)) +		\

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 7/16]Core 64 bit JBD changes
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (9 preceding siblings ...)
  2006-06-30  0:17 ` [RFC][Update][Patch 6/16]handing unitialized extents Mingming Cao
@ 2006-06-30  0:17 ` Mingming Cao
  2006-06-30  0:18 ` [RFC][Update][Patch 8/16]Avoid potential block overflow when writing journal metadata tags Mingming Cao
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

Here is the  patch to JBD to handle 64 bit block numbers, originally 
from Zach Brown. This patch is useful only after adding support for
64-bit block numbers in the filesystem.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Zach Brown <zach.brown@oracle.com>

---

 linux-2.6.17-ming/fs/jbd/commit.c     |   16 +++++++++---
 linux-2.6.17-ming/fs/jbd/journal.c    |   11 ++++++++
 linux-2.6.17-ming/fs/jbd/recovery.c   |   42 +++++++++++++++++++++++-----------
 linux-2.6.17-ming/fs/jbd/revoke.c     |   14 ++++++++---
 linux-2.6.17-ming/include/linux/jbd.h |   11 +++++++-
 5 files changed, 72 insertions(+), 22 deletions(-)

diff -puN fs/jbd/commit.c~64bit_jbd_core fs/jbd/commit.c
--- linux-2.6.17/fs/jbd/commit.c~64bit_jbd_core	2006-06-28 16:46:53.936267153 -0700
+++ linux-2.6.17-ming/fs/jbd/commit.c	2006-06-28 16:46:53.953265203 -0700
@@ -160,6 +160,12 @@ static int journal_write_commit_record(j
 	return (ret == -EIO);
 }
 
+static inline void write_split_be64(__be32 *high, __be32 *low, u64 val)
+{
+	*low = cpu_to_be32(val & (u32)~0);
+	*high = cpu_to_be32(val >> 32);
+}
+
 /*
  * journal_commit_transaction
  *
@@ -182,6 +188,7 @@ void journal_commit_transaction(journal_
 	int first_tag = 0;
 	int tag_flag;
 	int i;
+	int tag_bytes = journal_tag_bytes(journal);
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -553,10 +560,11 @@ write_out_data:
 			tag_flag |= JFS_FLAG_SAME_UUID;
 
 		tag = (journal_block_tag_t *) tagp;
-		tag->t_blocknr = cpu_to_be32(jh2bh(jh)->b_blocknr);
+		write_split_be64(&tag->t_blocknr_high, &tag->t_blocknr,
+				jh2bh(jh)->b_blocknr);
 		tag->t_flags = cpu_to_be32(tag_flag);
-		tagp += sizeof(journal_block_tag_t);
-		space_left -= sizeof(journal_block_tag_t);
+		tagp += tag_bytes;
+		space_left -= tag_bytes;
 
 		if (first_tag) {
 			memcpy (tagp, journal->j_uuid, 16);
@@ -570,7 +578,7 @@ write_out_data:
 
 		if (bufs == journal->j_wbufsize ||
 		    commit_transaction->t_buffers == NULL ||
-		    space_left < sizeof(journal_block_tag_t) + 16) {
+		    space_left < tag_bytes + 16) {
 
 			jbd_debug(4, "JBD: Submit %d IOs\n", bufs);
 
diff -puN fs/jbd/journal.c~64bit_jbd_core fs/jbd/journal.c
--- linux-2.6.17/fs/jbd/journal.c~64bit_jbd_core	2006-06-28 16:46:53.939266809 -0700
+++ linux-2.6.17-ming/fs/jbd/journal.c	2006-06-28 16:46:53.956264859 -0700
@@ -1603,6 +1603,17 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
+ * helper functions to deal with 32 or 64bit block numbers.
+ */
+size_t journal_tag_bytes(journal_t *journal)
+{
+	if (JFS_HAS_INCOMPAT_FEATURE(journal, JFS_FEATURE_INCOMPAT_64BIT))
+		return sizeof(journal_block_tag_t);
+	else
+		return offsetof(journal_block_tag_t, t_blocknr_high);
+}
+
+/*
  * Simple support for retrying memory allocations.  Introduced to help to
  * debug different VM deadlock avoidance strategies. 
  */
diff -puN fs/jbd/recovery.c~64bit_jbd_core fs/jbd/recovery.c
--- linux-2.6.17/fs/jbd/recovery.c~64bit_jbd_core	2006-06-28 16:46:53.942266465 -0700
+++ linux-2.6.17-ming/fs/jbd/recovery.c	2006-06-28 16:46:53.957264744 -0700
@@ -178,19 +178,20 @@ static int jread(struct buffer_head **bh
  * Count the number of in-use tags in a journal descriptor block.
  */
 
-static int count_tags(struct buffer_head *bh, int size)
+static int count_tags(journal_t *journal, struct buffer_head *bh)
 {
 	char *			tagp;
 	journal_block_tag_t *	tag;
-	int			nr = 0;
+	int			nr = 0, size = journal->j_blocksize;
+	int 			tag_bytes = journal_tag_bytes(journal);
 
 	tagp = &bh->b_data[sizeof(journal_header_t)];
 
-	while ((tagp - bh->b_data + sizeof(journal_block_tag_t)) <= size) {
+	while ((tagp - bh->b_data + tag_bytes) <= size) {
 		tag = (journal_block_tag_t *) tagp;
 
 		nr++;
-		tagp += sizeof(journal_block_tag_t);
+		tagp += tag_bytes;
 		if (!(tag->t_flags & cpu_to_be32(JFS_FLAG_SAME_UUID)))
 			tagp += 16;
 
@@ -307,6 +308,13 @@ int journal_skip_recovery(journal_t *jou
 	return err;
 }
 
+static inline u64 read_split_be64(__be32 *high, __be32 *low)
+{
+	u64 ret = be32_to_cpu(*low);
+	ret |= (u64)be32_to_cpu(*high) << 32;
+	return ret;
+}
+
 static int do_one_pass(journal_t *journal,
 			struct recovery_info *info, enum passtype pass)
 {
@@ -318,11 +326,12 @@ static int do_one_pass(journal_t *journa
 	struct buffer_head *	bh;
 	unsigned int		sequence;
 	int			blocktype;
+	int 			tag_bytes = journal_tag_bytes(journal);
 
 	/* Precompute the maximum metadata descriptors in a descriptor block */
 	int			MAX_BLOCKS_PER_DESC;
 	MAX_BLOCKS_PER_DESC = ((journal->j_blocksize-sizeof(journal_header_t))
-			       / sizeof(journal_block_tag_t));
+			       / tag_bytes);
 
 	/* 
 	 * First thing is to establish what we expect to find in the log
@@ -412,8 +421,7 @@ static int do_one_pass(journal_t *journa
 			 * in pass REPLAY; otherwise, just skip over the
 			 * blocks it describes. */
 			if (pass != PASS_REPLAY) {
-				next_log_block +=
-					count_tags(bh, journal->j_blocksize);
+				next_log_block += count_tags(journal, bh);
 				wrap(journal, next_log_block);
 				brelse(bh);
 				continue;
@@ -424,7 +432,7 @@ static int do_one_pass(journal_t *journa
 			 * getting done here! */
 
 			tagp = &bh->b_data[sizeof(journal_header_t)];
-			while ((tagp - bh->b_data +sizeof(journal_block_tag_t))
+			while ((tagp - bh->b_data + tag_bytes)
 			       <= journal->j_blocksize) {
 				unsigned long io_block;
 
@@ -446,7 +454,8 @@ static int do_one_pass(journal_t *journa
 					unsigned long blocknr;
 
 					J_ASSERT(obh != NULL);
-					blocknr = be32_to_cpu(tag->t_blocknr);
+					blocknr = read_split_be64(&tag->t_blocknr_high,
+							&tag->t_blocknr);
 
 					/* If the block has been
 					 * revoked, then we're all done
@@ -494,7 +503,7 @@ static int do_one_pass(journal_t *journa
 				}
 
 			skip_write:
-				tagp += sizeof(journal_block_tag_t);
+				tagp += tag_bytes;
 				if (!(flags & JFS_FLAG_SAME_UUID))
 					tagp += 16;
 
@@ -572,17 +581,24 @@ static int scan_revoke_records(journal_t
 {
 	journal_revoke_header_t *header;
 	int offset, max;
+	int record_len = 4;
 
 	header = (journal_revoke_header_t *) bh->b_data;
 	offset = sizeof(journal_revoke_header_t);
 	max = be32_to_cpu(header->r_count);
 
-	while (offset < max) {
+	if (JFS_HAS_INCOMPAT_FEATURE(journal, JFS_FEATURE_INCOMPAT_64BIT))
+		record_len = 8;
+
+	while (offset + record_len < max) {
 		unsigned long blocknr;
 		int err;
 
-		blocknr = be32_to_cpu(* ((__be32 *) (bh->b_data+offset)));
-		offset += 4;
+		if (record_len == 4)
+			blocknr = be32_to_cpu(* ((__be32 *) (bh->b_data+offset)));
+		else
+			blocknr = be64_to_cpu(* ((__be64 *) (bh->b_data+offset)));
+		offset += record_len;
 		err = journal_set_revoke(journal, blocknr, sequence);
 		if (err)
 			return err;
diff -puN fs/jbd/revoke.c~64bit_jbd_core fs/jbd/revoke.c
--- linux-2.6.17/fs/jbd/revoke.c~64bit_jbd_core	2006-06-28 16:46:53.945266121 -0700
+++ linux-2.6.17-ming/fs/jbd/revoke.c	2006-06-28 16:46:53.959264514 -0700
@@ -584,9 +584,17 @@ static void write_one_revoke_record(jour
 		*descriptorp = descriptor;
 	}
 
-	* ((__be32 *)(&jh2bh(descriptor)->b_data[offset])) = 
-		cpu_to_be32(record->blocknr);
-	offset += 4;
+	if (JFS_HAS_INCOMPAT_FEATURE(journal, JFS_FEATURE_INCOMPAT_64BIT)) {
+		* ((__be64 *)(&jh2bh(descriptor)->b_data[offset])) =
+			cpu_to_be64(record->blocknr);
+		offset += 8;
+
+	} else {
+		* ((__be32 *)(&jh2bh(descriptor)->b_data[offset])) =
+			cpu_to_be32(record->blocknr);
+		offset += 4;
+	}
+
 	*offsetp = offset;
 }
 
diff -puN include/linux/jbd.h~64bit_jbd_core include/linux/jbd.h
--- linux-2.6.17/include/linux/jbd.h~64bit_jbd_core	2006-06-28 16:46:53.949265662 -0700
+++ linux-2.6.17-ming/include/linux/jbd.h	2006-06-28 16:46:53.961264285 -0700
@@ -147,12 +147,16 @@ typedef struct journal_header_s
 
 
 /* 
- * The block tag: used to describe a single buffer in the journal 
+ * The block tag: used to describe a single buffer in the journal.
+ * t_blocknr_high is only used if INCOMPAT_64BIT is set, so this
+ * raw struct shouldn't be used for pointer math or sizeof() - use
+ * journal_tag_bytes(journal) instead to compute this.
  */
 typedef struct journal_block_tag_s
 {
 	__be32		t_blocknr;	/* The on-disk block number */
 	__be32		t_flags;	/* See below */
+	__be32		t_blocknr_high; /* most-significant high 32bits. */
 } journal_block_tag_t;
 
 /* 
@@ -232,11 +236,13 @@ typedef struct journal_superblock_s
 	 ((j)->j_superblock->s_feature_incompat & cpu_to_be32((mask))))
 
 #define JFS_FEATURE_INCOMPAT_REVOKE	0x00000001
+#define JFS_FEATURE_INCOMPAT_64BIT	0x00000002
 
 /* Features known to this kernel version: */
 #define JFS_KNOWN_COMPAT_FEATURES	0
 #define JFS_KNOWN_ROCOMPAT_FEATURES	0
-#define JFS_KNOWN_INCOMPAT_FEATURES	JFS_FEATURE_INCOMPAT_REVOKE
+#define JFS_KNOWN_INCOMPAT_FEATURES	(JFS_FEATURE_INCOMPAT_REVOKE | \
+					 JFS_FEATURE_INCOMPAT_64BIT)
 
 #ifdef __KERNEL__
 
@@ -1050,6 +1056,7 @@ static inline int tid_geq(tid_t x, tid_t
 }
 
 extern int journal_blocks_per_page(struct inode *inode);
+extern size_t journal_tag_bytes(journal_t *journal);
 
 /*
  * Return the minimum number of blocks which must be free in the journal

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 8/16]Avoid potential block overflow when writing journal metadata tags
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (10 preceding siblings ...)
  2006-06-30  0:17 ` [RFC][Update][Patch 7/16]Core 64 bit JBD changes Mingming Cao
@ 2006-06-30  0:18 ` Mingming Cao
  2006-06-30  0:18 ` [RFC][Update][Patch 9/16]Fix reading of 32-bit tag descriptors Mingming Cao
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

When writing block numbers into a journal descriptor block, don't write
the top 32 bits of a tag unless we're using a 64-bit journal.  That
avoids any possibility of overflowing off the end of the descriptor
block in the case where the last 32-bit tag only just fits into the
descriptor block.

Also cleans up the tag handling slightly by introducing new macros for
the size of 32- and 64-bit descriptor tags.

Signed-off-by: Stephen Tweedie <sct@redhat.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>


---

 linux-2.6.17-ming/fs/jbd/commit.c     |   11 ++++++-----
 linux-2.6.17-ming/include/linux/jbd.h |    3 +++
 2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN fs/jbd/commit.c~jbd-avoid-blk-overflow-write-journal-metadata-tag fs/jbd/commit.c
--- linux-2.6.17/fs/jbd/commit.c~jbd-avoid-blk-overflow-write-journal-metadata-tag	2006-06-28 16:46:58.783710948 -0700
+++ linux-2.6.17-ming/fs/jbd/commit.c	2006-06-28 16:46:58.791710030 -0700
@@ -160,10 +160,12 @@ static int journal_write_commit_record(j
 	return (ret == -EIO);
 }
 
-static inline void write_split_be64(__be32 *high, __be32 *low, u64 val)
+static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
+				   sector_t block)
 {
-	*low = cpu_to_be32(val & (u32)~0);
-	*high = cpu_to_be32(val >> 32);
+	tag->t_blocknr = cpu_to_be32(block & (u32)~0);
+	if (tag_bytes > JBD_TAG_SIZE32)
+		tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1);
 }
 
 /*
@@ -560,8 +562,7 @@ write_out_data:
 			tag_flag |= JFS_FLAG_SAME_UUID;
 
 		tag = (journal_block_tag_t *) tagp;
-		write_split_be64(&tag->t_blocknr_high, &tag->t_blocknr,
-				jh2bh(jh)->b_blocknr);
+		write_tag_block(tag_bytes, tag, jh2bh(jh)->b_blocknr);
 		tag->t_flags = cpu_to_be32(tag_flag);
 		tagp += tag_bytes;
 		space_left -= tag_bytes;
diff -puN include/linux/jbd.h~jbd-avoid-blk-overflow-write-journal-metadata-tag include/linux/jbd.h
--- linux-2.6.17/include/linux/jbd.h~jbd-avoid-blk-overflow-write-journal-metadata-tag	2006-06-28 16:46:58.787710489 -0700
+++ linux-2.6.17-ming/include/linux/jbd.h	2006-06-28 16:46:58.793709801 -0700
@@ -159,6 +159,9 @@ typedef struct journal_block_tag_s
 	__be32		t_blocknr_high; /* most-significant high 32bits. */
 } journal_block_tag_t;
 
+#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
+#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t))
+
 /* 
  * The revoke descriptor: used on disk to describe a series of blocks to
  * be revoked from the log 

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 9/16]Fix reading of 32-bit tag descriptors
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (11 preceding siblings ...)
  2006-06-30  0:18 ` [RFC][Update][Patch 8/16]Avoid potential block overflow when writing journal metadata tags Mingming Cao
@ 2006-06-30  0:18 ` Mingming Cao
  2006-06-30  0:18 ` [RFC][Update][Patch 10/16]Cleanup journal_tag_bytes() Mingming Cao
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

We must never attempt to read the high 32-bits of a descriptor tag on
a 32-bit journal, even when CONFIG_LBD is set, as we'll end up reading
garbage from the subsequent tag.

Signed-off-by: Stephen Tweedie <sct@redhat.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>


---

 linux-2.6.17-ming/fs/jbd/recovery.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff -puN fs/jbd/recovery.c~jbd-read-32bit-tag-fix fs/jbd/recovery.c
--- linux-2.6.17/fs/jbd/recovery.c~jbd-read-32bit-tag-fix	2006-06-28 16:47:02.555278191 -0700
+++ linux-2.6.17-ming/fs/jbd/recovery.c	2006-06-28 16:47:02.558277847 -0700
@@ -308,11 +308,12 @@ int journal_skip_recovery(journal_t *jou
 	return err;
 }
 
-static inline u64 read_split_be64(__be32 *high, __be32 *low)
+static inline sector_t read_tag_block(int tag_bytes, journal_block_tag_t *tag)
 {
-	u64 ret = be32_to_cpu(*low);
-	ret |= (u64)be32_to_cpu(*high) << 32;
-	return ret;
+	sector_t block = be32_to_cpu(tag->t_blocknr);
+	if (tag_bytes > JBD_TAG_SIZE32)
+		block |= (u64)be32_to_cpu(tag->t_blocknr_high) << 32;
+	return block;
 }
 
 static int do_one_pass(journal_t *journal,
@@ -454,8 +455,8 @@ static int do_one_pass(journal_t *journa
 					unsigned long blocknr;
 
 					J_ASSERT(obh != NULL);
-					blocknr = read_split_be64(&tag->t_blocknr_high,
-							&tag->t_blocknr);
+					blocknr = read_tag_block(tag_bytes,
+								 tag);
 
 					/* If the block has been
 					 * revoked, then we're all done

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 10/16]Cleanup journal_tag_bytes()
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (12 preceding siblings ...)
  2006-06-30  0:18 ` [RFC][Update][Patch 9/16]Fix reading of 32-bit tag descriptors Mingming Cao
@ 2006-06-30  0:18 ` Mingming Cao
  2006-06-30  0:18 ` [RFC][Update][Patch 11/16]JBD layer in-kernel block variables type fixes Mingming Cao
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

Cleanup journal_tag_bytes() to use the new JBD_TAG_SIZE* macros.

Signed-off-by: Stephen Tweedie <sct@redhat.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>


---

 linux-2.6.17-ming/fs/jbd/journal.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff -puN fs/jbd/journal.c~jbd-cleanup-journal_tag_bytes fs/jbd/journal.c
--- linux-2.6.17/fs/jbd/journal.c~jbd-cleanup-journal_tag_bytes	2006-06-28 16:47:05.112984715 -0700
+++ linux-2.6.17-ming/fs/jbd/journal.c	2006-06-28 16:47:05.117984141 -0700
@@ -1608,9 +1608,9 @@ int journal_blocks_per_page(struct inode
 size_t journal_tag_bytes(journal_t *journal)
 {
 	if (JFS_HAS_INCOMPAT_FEATURE(journal, JFS_FEATURE_INCOMPAT_64BIT))
-		return sizeof(journal_block_tag_t);
+		return JBD_TAG_SIZE64;
 	else
-		return offsetof(journal_block_tag_t, t_blocknr_high);
+		return JBD_TAG_SIZE32;
 }
 
 /*

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 11/16]JBD layer in-kernel block variables type fixes
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (13 preceding siblings ...)
  2006-06-30  0:18 ` [RFC][Update][Patch 10/16]Cleanup journal_tag_bytes() Mingming Cao
@ 2006-06-30  0:18 ` Mingming Cao
  2006-06-30  0:18 ` [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code Mingming Cao
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel


JBD layer in-kernel block varibles type fixes to support >32 bit block number
and convert to sector_t type.

Signed-Off-By: Mingming Cao <cmm@us.ibm.com>

---



---

 linux-2.6.17-ming/fs/jbd/commit.c          |    2 +-
 linux-2.6.17-ming/fs/jbd/journal.c         |   16 ++++++++--------
 linux-2.6.17-ming/fs/jbd/recovery.c        |    8 ++++----
 linux-2.6.17-ming/fs/jbd/revoke.c          |   24 +++++++++++++-----------
 linux-2.6.17-ming/include/linux/ext3_jbd.h |    2 +-
 linux-2.6.17-ming/include/linux/jbd.h      |   17 ++++++++---------
 6 files changed, 35 insertions(+), 34 deletions(-)

diff -puN fs/jbd/commit.c~sector_t-jbd fs/jbd/commit.c
--- linux-2.6.17/fs/jbd/commit.c~sector_t-jbd	2006-06-28 16:47:07.568702941 -0700
+++ linux-2.6.17-ming/fs/jbd/commit.c	2006-06-28 16:47:07.590700417 -0700
@@ -182,7 +182,7 @@ void journal_commit_transaction(journal_
 	int bufs;
 	int flags;
 	int err;
-	unsigned long blocknr;
+	sector_t blocknr;
 	char *tagp = NULL;
 	journal_header_t *header;
 	journal_block_tag_t *tag = NULL;
diff -puN fs/jbd/journal.c~sector_t-jbd fs/jbd/journal.c
--- linux-2.6.17/fs/jbd/journal.c~sector_t-jbd	2006-06-28 16:47:07.572702482 -0700
+++ linux-2.6.17-ming/fs/jbd/journal.c	2006-06-28 16:47:07.593700073 -0700
@@ -270,7 +270,7 @@ static void journal_kill_thread(journal_
 int journal_write_metadata_buffer(transaction_t *transaction,
 				  struct journal_head  *jh_in,
 				  struct journal_head **jh_out,
-				  int blocknr)
+				  sector_t blocknr)
 {
 	int need_copy_out = 0;
 	int done_copy_out = 0;
@@ -554,7 +554,7 @@ int log_wait_commit(journal_t *journal, 
  * Log buffer allocation routines:
  */
 
-int journal_next_log_block(journal_t *journal, unsigned long *retp)
+int journal_next_log_block(journal_t *journal, sector_t *retp)
 {
 	unsigned long blocknr;
 
@@ -578,10 +578,10 @@ int journal_next_log_block(journal_t *jo
  * ready.
  */
 int journal_bmap(journal_t *journal, unsigned long blocknr, 
-		 unsigned long *retp)
+		 sector_t *retp)
 {
 	int err = 0;
-	unsigned long ret;
+	sector_t ret;
 
 	if (journal->j_inode) {
 		ret = bmap(journal->j_inode, blocknr);
@@ -617,7 +617,7 @@ int journal_bmap(journal_t *journal, uns
 struct journal_head *journal_get_descriptor_buffer(journal_t *journal)
 {
 	struct buffer_head *bh;
-	unsigned long blocknr;
+	sector_t blocknr;
 	int err;
 
 	err = journal_next_log_block(journal, &blocknr);
@@ -705,7 +705,7 @@ fail:
  */
 journal_t * journal_init_dev(struct block_device *bdev,
 			struct block_device *fs_dev,
-			int start, int len, int blocksize)
+			sector_t start, int len, int blocksize)
 {
 	journal_t *journal = journal_init_common();
 	struct buffer_head *bh;
@@ -753,7 +753,7 @@ journal_t * journal_init_inode (struct i
 	journal_t *journal = journal_init_common();
 	int err;
 	int n;
-	unsigned long blocknr;
+	sector_t blocknr;
 
 	if (!journal)
 		return NULL;
@@ -853,7 +853,7 @@ static int journal_reset(journal_t *jour
  **/
 int journal_create(journal_t *journal)
 {
-	unsigned long blocknr;
+	sector_t blocknr;
 	struct buffer_head *bh;
 	journal_superblock_t *sb;
 	int i, err;
diff -puN fs/jbd/recovery.c~sector_t-jbd fs/jbd/recovery.c
--- linux-2.6.17/fs/jbd/recovery.c~sector_t-jbd	2006-06-28 16:47:07.575702138 -0700
+++ linux-2.6.17-ming/fs/jbd/recovery.c	2006-06-28 16:47:07.595699844 -0700
@@ -70,7 +70,7 @@ static int do_readahead(journal_t *journ
 {
 	int err;
 	unsigned int max, nbufs, next;
-	unsigned long blocknr;
+	sector_t blocknr;
 	struct buffer_head *bh;
 
 	struct buffer_head * bufs[MAXBUF];
@@ -132,7 +132,7 @@ static int jread(struct buffer_head **bh
 		 unsigned int offset)
 {
 	int err;
-	unsigned long blocknr;
+	sector_t blocknr;
 	struct buffer_head *bh;
 
 	*bhp = NULL;
@@ -452,7 +452,7 @@ static int do_one_pass(journal_t *journa
 						"block %ld in log\n",
 						err, io_block);
 				} else {
-					unsigned long blocknr;
+					sector_t blocknr;
 
 					J_ASSERT(obh != NULL);
 					blocknr = read_tag_block(tag_bytes,
@@ -592,7 +592,7 @@ static int scan_revoke_records(journal_t
 		record_len = 8;
 
 	while (offset + record_len < max) {
-		unsigned long blocknr;
+		sector_t blocknr;
 		int err;
 
 		if (record_len == 4)
diff -puN fs/jbd/revoke.c~sector_t-jbd fs/jbd/revoke.c
--- linux-2.6.17/fs/jbd/revoke.c~sector_t-jbd	2006-06-28 16:47:07.578701794 -0700
+++ linux-2.6.17-ming/fs/jbd/revoke.c	2006-06-28 16:47:07.596699729 -0700
@@ -81,7 +81,7 @@ struct jbd_revoke_record_s 
 {
 	struct list_head  hash;
 	tid_t		  sequence;	/* Used for recovery only */
-	unsigned long	  blocknr;
+	sector_t	  blocknr;
 };
 
 
@@ -106,17 +106,18 @@ static void flush_descriptor(journal_t *
 /* Utility functions to maintain the revoke table */
 
 /* Borrowed from buffer.c: this is a tried and tested block hash function */
-static inline int hash(journal_t *journal, unsigned long block)
+static inline int hash(journal_t *journal, sector_t block)
 {
 	struct jbd_revoke_table_s *table = journal->j_revoke;
 	int hash_shift = table->hash_shift;
+	int hash = (int)block ^ (int)(block >> 32);
 
-	return ((block << (hash_shift - 6)) ^
-		(block >> 13) ^
-		(block << (hash_shift - 12))) & (table->hash_size - 1);
+	return ((hash << (hash_shift - 6)) ^
+		(hash >> 13) ^
+		(hash << (hash_shift - 12))) & (table->hash_size - 1);
 }
 
-static int insert_revoke_hash(journal_t *journal, unsigned long blocknr,
+static int insert_revoke_hash(journal_t *journal, sector_t blocknr,
 			      tid_t seq)
 {
 	struct list_head *hash_list;
@@ -146,7 +147,7 @@ oom:
 /* Find a revoke record in the journal's hash table. */
 
 static struct jbd_revoke_record_s *find_revoke_record(journal_t *journal,
-						      unsigned long blocknr)
+						      sector_t blocknr)
 {
 	struct list_head *hash_list;
 	struct jbd_revoke_record_s *record;
@@ -325,7 +326,7 @@ void journal_destroy_revoke(journal_t *j
  * by one.
  */
 
-int journal_revoke(handle_t *handle, unsigned long blocknr, 
+int journal_revoke(handle_t *handle, sector_t blocknr,
 		   struct buffer_head *bh_in)
 {
 	struct buffer_head *bh = NULL;
@@ -394,7 +395,8 @@ int journal_revoke(handle_t *handle, uns
 		}
 	}
 
-	jbd_debug(2, "insert revoke for block %lu, bh_in=%p\n", blocknr, bh_in);
+	jbd_debug(2, "insert revoke for block %llu, bh_in=%p\n",
+		blocknr, bh_in);
 	err = insert_revoke_hash(journal, blocknr,
 				handle->h_transaction->t_tid);
 	BUFFER_TRACE(bh_in, "exit");
@@ -649,7 +651,7 @@ static void flush_descriptor(journal_t *
  */
 
 int journal_set_revoke(journal_t *journal, 
-		       unsigned long blocknr, 
+		       sector_t blocknr,
 		       tid_t sequence)
 {
 	struct jbd_revoke_record_s *record;
@@ -673,7 +675,7 @@ int journal_set_revoke(journal_t *journa
  */
 
 int journal_test_revoke(journal_t *journal, 
-			unsigned long blocknr,
+			sector_t blocknr,
 			tid_t sequence)
 {
 	struct jbd_revoke_record_s *record;
diff -puN include/linux/ext3_jbd.h~sector_t-jbd include/linux/ext3_jbd.h
--- linux-2.6.17/include/linux/ext3_jbd.h~sector_t-jbd	2006-06-28 16:47:07.581701450 -0700
+++ linux-2.6.17-ming/include/linux/ext3_jbd.h	2006-06-28 16:47:07.597699614 -0700
@@ -154,7 +154,7 @@ __ext3_journal_forget(const char *where,
 
 static inline int
 __ext3_journal_revoke(const char *where, handle_t *handle,
-		      unsigned long blocknr, struct buffer_head *bh)
+		      ext3_fsblk_t blocknr, struct buffer_head *bh)
 {
 	int err = journal_revoke(handle, blocknr, bh);
 	if (err)
diff -puN include/linux/jbd.h~sector_t-jbd include/linux/jbd.h
--- linux-2.6.17/include/linux/jbd.h~sector_t-jbd	2006-06-28 16:47:07.585700991 -0700
+++ linux-2.6.17-ming/include/linux/jbd.h	2006-06-28 16:47:07.600699270 -0700
@@ -738,7 +738,7 @@ struct journal_s
 	 */
 	struct block_device	*j_dev;
 	int			j_blocksize;
-	unsigned int		j_blk_offset;
+	sector_t		j_blk_offset;
 
 	/*
 	 * Device which holds the client fs.  For internal journal this will be
@@ -857,7 +857,7 @@ extern void __journal_clean_data_list(tr
 
 /* Log buffer allocation */
 extern struct journal_head * journal_get_descriptor_buffer(journal_t *);
-int journal_next_log_block(journal_t *, unsigned long *);
+int journal_next_log_block(journal_t *, sector_t *);
 
 /* Commit management */
 extern void journal_commit_transaction(journal_t *);
@@ -872,7 +872,7 @@ extern int 
 journal_write_metadata_buffer(transaction_t	  *transaction,
 			      struct journal_head  *jh_in,
 			      struct journal_head **jh_out,
-			      int		   blocknr);
+			      sector_t   blocknr);
 
 /* Transaction locking */
 extern void		__wait_on_journal (journal_t *);
@@ -920,7 +920,7 @@ extern void	 journal_unlock_updates (jou
 
 extern journal_t * journal_init_dev(struct block_device *bdev,
 				struct block_device *fs_dev,
-				int start, int len, int bsize);
+				sector_t start, int len, int bsize);
 extern journal_t * journal_init_inode (struct inode *);
 extern int	   journal_update_format (journal_t *);
 extern int	   journal_check_used_features 
@@ -941,7 +941,7 @@ extern void	   journal_abort      (journ
 extern int	   journal_errno      (journal_t *);
 extern void	   journal_ack_err    (journal_t *);
 extern int	   journal_clear_err  (journal_t *);
-extern int	   journal_bmap(journal_t *, unsigned long, unsigned long *);
+extern int	   journal_bmap(journal_t *, unsigned long, sector_t *);
 extern int	   journal_force_commit(journal_t *);
 
 /*
@@ -974,14 +974,13 @@ extern void	   journal_destroy_revoke_ca
 extern int	   journal_init_revoke_caches(void);
 
 extern void	   journal_destroy_revoke(journal_t *);
-extern int	   journal_revoke (handle_t *,
-				unsigned long, struct buffer_head *);
+extern int	   journal_revoke (handle_t *, sector_t, struct buffer_head *);
 extern int	   journal_cancel_revoke(handle_t *, struct journal_head *);
 extern void	   journal_write_revoke_records(journal_t *, transaction_t *);
 
 /* Recovery revoke support */
-extern int	journal_set_revoke(journal_t *, unsigned long, tid_t);
-extern int	journal_test_revoke(journal_t *, unsigned long, tid_t);
+extern int	journal_set_revoke(journal_t *, sector_t, tid_t);
+extern int	journal_test_revoke(journal_t *, sector_t, tid_t);
 extern void	journal_clear_revoke(journal_t *);
 extern void	journal_brelse_array(struct buffer_head *b[], int n);
 extern void	journal_switch_revoke_table(journal_t *journal);

_




Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (14 preceding siblings ...)
  2006-06-30  0:18 ` [RFC][Update][Patch 11/16]JBD layer in-kernel block variables type fixes Mingming Cao
@ 2006-06-30  0:18 ` Mingming Cao
  2006-06-30  3:15   ` H. Peter Anvin
  2006-06-30  0:18 ` [RFC][Update][Patch 13/16] 48 bit on-disk i_file_acl support Mingming Cao
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

"val >> 32" is undefined if val is a 32-bit value, so this code is
broken if CONFIG_LBD is not set.  Make it safe for that case.

Signed-off-by: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>


---

 linux-2.6.17-ming/fs/jbd/revoke.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/jbd/revoke.c~jbd-revoke-32bit-shift-fix fs/jbd/revoke.c
--- linux-2.6.17/fs/jbd/revoke.c~jbd-revoke-32bit-shift-fix	2006-06-28 16:47:09.695458913 -0700
+++ linux-2.6.17-ming/fs/jbd/revoke.c	2006-06-28 16:47:09.699458454 -0700
@@ -110,7 +110,7 @@ static inline int hash(journal_t *journa
 {
 	struct jbd_revoke_table_s *table = journal->j_revoke;
 	int hash_shift = table->hash_shift;
-	int hash = (int)block ^ (int)(block >> 32);
+	int hash = (int)block ^ (int)((block >> 31) >> 1);
 
 	return ((hash << (hash_shift - 6)) ^
 		(hash >> 13) ^

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 13/16] 48 bit on-disk i_file_acl support
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (15 preceding siblings ...)
  2006-06-30  0:18 ` [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code Mingming Cao
@ 2006-06-30  0:18 ` Mingming Cao
  2006-06-30  0:19 ` [RFC][Update][Patch 14/16] 48bit super block (metadata) changes Mingming Cao
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

As we are planning to support 48-bit block numbers for ext3,
we need to support 48-bit block numbers for extended attributes.
In the short term, we can do this by reuse (on-disk) 16-bit
padding (linux2.i_pad1 currently used only by "hurd") as high 
order bits for xattr. This patch basically does that.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>


---

 linux-2.6.17-ming/fs/ext3/inode.c         |    8 ++++++++
 linux-2.6.17-ming/include/linux/ext3_fs.h |    6 ++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff -puN fs/ext3/inode.c~ext3_48bit_i_file_acl fs/ext3/inode.c
--- linux-2.6.17/fs/ext3/inode.c~ext3_48bit_i_file_acl	2006-06-28 16:47:11.921203527 -0700
+++ linux-2.6.17-ming/fs/ext3/inode.c	2006-06-28 16:47:11.932202265 -0700
@@ -2641,6 +2641,10 @@ void ext3_read_inode(struct inode * inod
 	ei->i_frag_size = raw_inode->i_fsize;
 #endif
 	ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl);
+	if ((sizeof(sector_t) > 4) &&
+	    (EXT3_SB(inode->i_sb)->s_es->s_creator_os != EXT3_OS_HURD))
+		ei->i_file_acl |=
+			((__u64)le16_to_cpu(raw_inode->i_file_acl_high)) << 32;
 	if (!S_ISREG(inode->i_mode)) {
 		ei->i_dir_acl = le32_to_cpu(raw_inode->i_dir_acl);
 	} else {
@@ -2774,6 +2778,10 @@ static int ext3_do_update_inode(handle_t
 	raw_inode->i_frag = ei->i_frag_no;
 	raw_inode->i_fsize = ei->i_frag_size;
 #endif
+	if ((sizeof(sector_t) > 4) &&
+	    (EXT3_SB(inode->i_sb)->s_es->s_creator_os != EXT3_OS_HURD))
+		raw_inode->i_file_acl_high =
+			cpu_to_le16((__u64)ei->i_file_acl >> 32);
 	raw_inode->i_file_acl = cpu_to_le32(ei->i_file_acl);
 	if (!S_ISREG(inode->i_mode)) {
 		raw_inode->i_dir_acl = cpu_to_le32(ei->i_dir_acl);
diff -puN include/linux/ext3_fs.h~ext3_48bit_i_file_acl include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~ext3_48bit_i_file_acl	2006-06-28 16:47:11.925203068 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h	2006-06-28 16:47:11.934202036 -0700
@@ -285,7 +285,7 @@ struct ext3_inode {
 		struct {
 			__u8	l_i_frag;	/* Fragment number */
 			__u8	l_i_fsize;	/* Fragment size */
-			__u16	i_pad1;
+			__u16	l_i_file_acl_high;
 			__le16	l_i_uid_high;	/* these 2 fields    */
 			__le16	l_i_gid_high;	/* were reserved2[0] */
 			__u32	l_i_reserved2;
@@ -301,7 +301,7 @@ struct ext3_inode {
 		struct {
 			__u8	m_i_frag;	/* Fragment number */
 			__u8	m_i_fsize;	/* Fragment size */
-			__u16	m_pad1;
+			__u16	m_i_file_acl_high;
 			__u32	m_i_reserved2[2];
 		} masix2;
 	} osd2;				/* OS dependent 2 */
@@ -315,6 +315,7 @@ struct ext3_inode {
 #define i_reserved1	osd1.linux1.l_i_reserved1
 #define i_frag		osd2.linux2.l_i_frag
 #define i_fsize		osd2.linux2.l_i_fsize
+#define i_file_acl_high	osd2.linux2.l_i_file_acl_high
 #define i_uid_low	i_uid
 #define i_gid_low	i_gid
 #define i_uid_high	osd2.linux2.l_i_uid_high
@@ -335,6 +336,7 @@ struct ext3_inode {
 #define i_reserved1	osd1.masix1.m_i_reserved1
 #define i_frag		osd2.masix2.m_i_frag
 #define i_fsize		osd2.masix2.m_i_fsize
+#define i_file_acl_high	osd2.masix2.m_i_file_acl_high
 #define i_reserved2	osd2.masix2.m_i_reserved2
 
 #endif /* defined(__KERNEL__) || defined(__linux__) */

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 14/16] 48bit super block (metadata) changes
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (16 preceding siblings ...)
  2006-06-30  0:18 ` [RFC][Update][Patch 13/16] 48 bit on-disk i_file_acl support Mingming Cao
@ 2006-06-30  0:19 ` Mingming Cao
  2006-06-30  0:19 ` [RFC][Update][Patch 15/16] compile warning fix and change 64bit to INCOMPAT feature Mingming Cao
  2006-06-30  0:19 ` [RFC][Update][Patch 16/16]Update ext3 superblock definition Mingming Cao
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

In-kernel and on-disk super block changes to support >32 bit blocks numbers.

Signed-Off-By: Laurent Vivier <Laurent.Vivier@bull.net>


---

 linux-2.6.17-ming/fs/ext3/balloc.c        |   52 ++++++++++++------
 linux-2.6.17-ming/fs/ext3/ialloc.c        |   10 ++-
 linux-2.6.17-ming/fs/ext3/inode.c         |    9 ++-
 linux-2.6.17-ming/fs/ext3/resize.c        |   25 +++++----
 linux-2.6.17-ming/fs/ext3/super.c         |   50 ++++++++++--------
 linux-2.6.17-ming/include/linux/ext3_fs.h |   83 +++++++++++++++++++++++++++++-
 6 files changed, 169 insertions(+), 60 deletions(-)

diff -puN fs/ext3/balloc.c~64bit-metadata fs/ext3/balloc.c
--- linux-2.6.17/fs/ext3/balloc.c~64bit-metadata	2006-06-28 16:47:14.234938045 -0700
+++ linux-2.6.17-ming/fs/ext3/balloc.c	2006-06-28 16:47:14.257935406 -0700
@@ -88,12 +88,16 @@ read_block_bitmap(struct super_block *sb
 	desc = ext3_get_group_desc (sb, block_group, NULL);
 	if (!desc)
 		goto error_out;
-	bh = sb_bread(sb, le32_to_cpu(desc->bg_block_bitmap));
+	bh = sb_bread(sb,
+		      EXT3_BLOCK_BITMAP(desc,
+			      ext3_group_first_block_no(sb, block_group)));
 	if (!bh)
 		ext3_error (sb, "read_block_bitmap",
 			    "Cannot read block bitmap - "
-			    "block_group = %d, block_bitmap = %u",
-			    block_group, le32_to_cpu(desc->bg_block_bitmap));
+			    "block_group = %d, block_bitmap = "E3FSBLK,
+			    block_group,
+			    EXT3_BLOCK_BITMAP(desc,
+			      ext3_group_first_block_no(sb, block_group)));
 error_out:
 	return bh;
 }
@@ -328,7 +332,7 @@ void ext3_free_blocks_sb(handle_t *handl
 	es = sbi->s_es;
 	if (block < le32_to_cpu(es->s_first_data_block) ||
 	    block + count < block ||
-	    block + count > le32_to_cpu(es->s_blocks_count)) {
+	    block + count > EXT3_BLOCKS_COUNT(es)) {
 		ext3_error (sb, "ext3_free_blocks",
 			    "Freeing blocks not in datazone - "
 			    "block = "E3FSBLK", count = %lu", block, count);
@@ -356,11 +360,19 @@ do_more:
 	if (!desc)
 		goto error_return;
 
-	if (in_range (le32_to_cpu(desc->bg_block_bitmap), block, count) ||
-	    in_range (le32_to_cpu(desc->bg_inode_bitmap), block, count) ||
-	    in_range (block, le32_to_cpu(desc->bg_inode_table),
+	if (in_range (EXT3_BLOCK_BITMAP(desc,
+				ext3_group_first_block_no(sb, block_group)),
+		      block, count) ||
+	    in_range (EXT3_INODE_BITMAP(desc,
+			    	ext3_group_first_block_no(sb, block_group)),
+		      block, count) ||
+	    in_range (block,
+		      EXT3_INODE_TABLE(desc,
+			      	ext3_group_first_block_no(sb, block_group)),
 		      sbi->s_itb_per_group) ||
-	    in_range (block + count - 1, le32_to_cpu(desc->bg_inode_table),
+	    in_range (block + count - 1,
+		      EXT3_INODE_TABLE(desc,
+			      	ext3_group_first_block_no(sb, block_group)),
 		      sbi->s_itb_per_group))
 		ext3_error (sb, "ext3_free_blocks",
 			    "Freeing blocks in system zones - "
@@ -1163,7 +1175,7 @@ static int ext3_has_free_blocks(struct e
 	ext3_fsblk_t free_blocks, root_blocks;
 
 	free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
-	root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
+	root_blocks = EXT3_R_BLOCKS_COUNT(sbi->s_es);
 	if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
 		sbi->s_resuid != current->fsuid &&
 		(sbi->s_resgid == 0 || !in_group_p (sbi->s_resgid))) {
@@ -1262,7 +1274,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
 	 * First, test whether the goal block is free.
 	 */
 	if (goal < le32_to_cpu(es->s_first_data_block) ||
-	    goal >= le32_to_cpu(es->s_blocks_count))
+	    goal >= EXT3_BLOCKS_COUNT(es))
 		goal = le32_to_cpu(es->s_first_data_block);
 	ext3_get_group_no_and_offset(sb, goal, &group_no, &grp_target_blk);
 	gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
@@ -1361,11 +1373,15 @@ allocated:
 
 	ret_block = grp_alloc_blk + ext3_group_first_block_no(sb, group_no);
 
-	if (in_range(le32_to_cpu(gdp->bg_block_bitmap), ret_block, num) ||
-	    in_range(le32_to_cpu(gdp->bg_inode_bitmap), ret_block, num) ||
-	    in_range(ret_block, le32_to_cpu(gdp->bg_inode_table),
+	if (in_range(EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, group_no)),
+				ret_block, num) ||
+	    in_range(EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, group_no)),
+		    		ret_block, num) ||
+	    in_range(ret_block, EXT3_INODE_TABLE(gdp,
+			    	ext3_group_first_block_no(sb, group_no)),
 		      EXT3_SB(sb)->s_itb_per_group) ||
-	    in_range(ret_block + num - 1, le32_to_cpu(gdp->bg_inode_table),
+	    in_range(ret_block + num - 1, EXT3_INODE_TABLE(gdp,
+			    ext3_group_first_block_no(sb, group_no)),
 		      EXT3_SB(sb)->s_itb_per_group))
 		ext3_error(sb, "ext3_new_block",
 			    "Allocating block in system zone - "
@@ -1404,11 +1420,11 @@ allocated:
 	jbd_unlock_bh_state(bitmap_bh);
 #endif
 
-	if (ret_block + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
+	if (ret_block + num - 1 >= EXT3_BLOCKS_COUNT(es)) {
 		ext3_error(sb, "ext3_new_block",
-			    "block("E3FSBLK") >= blocks count(%d) - "
+			    "block("E3FSBLK") >= blocks count("E3FSBLK") - "
 			    "block_group = %lu, es == %p ", ret_block,
-			le32_to_cpu(es->s_blocks_count), group_no, es);
+			EXT3_BLOCKS_COUNT(es), group_no, es);
 		goto out;
 	}
 
@@ -1501,7 +1517,7 @@ ext3_fsblk_t ext3_count_free_blocks(stru
 	brelse(bitmap_bh);
 	printk("ext3_count_free_blocks: stored = "E3FSBLK
 		", computed = "E3FSBLK", "E3FSBLK"\n",
-	       le32_to_cpu(es->s_free_blocks_count),
+	       EXT3_FREE_BLOCKS_COUNT(es),
 		desc_count, bitmap_count);
 	return bitmap_count;
 #else
diff -puN fs/ext3/ialloc.c~64bit-metadata fs/ext3/ialloc.c
--- linux-2.6.17/fs/ext3/ialloc.c~64bit-metadata	2006-06-28 16:47:14.237937700 -0700
+++ linux-2.6.17-ming/fs/ext3/ialloc.c	2006-06-28 16:47:14.259935176 -0700
@@ -60,12 +60,14 @@ read_inode_bitmap(struct super_block * s
 	if (!desc)
 		goto error_out;
 
-	bh = sb_bread(sb, le32_to_cpu(desc->bg_inode_bitmap));
+	bh = sb_bread(sb, EXT3_INODE_BITMAP(desc,
+			      ext3_group_first_block_no(sb, block_group)));
 	if (!bh)
 		ext3_error(sb, "read_inode_bitmap",
 			    "Cannot read inode bitmap - "
-			    "block_group = %lu, inode_bitmap = %u",
-			    block_group, le32_to_cpu(desc->bg_inode_bitmap));
+			    "block_group = %lu, inode_bitmap = %llu",
+			    block_group, EXT3_INODE_BITMAP(desc,
+			      ext3_group_first_block_no(sb, block_group)));
 error_out:
 	return bh;
 }
@@ -304,7 +306,7 @@ static int find_group_orlov(struct super
 		goto fallback;
 	}
 
-	blocks_per_dir = le32_to_cpu(es->s_blocks_count) - freeb;
+	blocks_per_dir = EXT3_BLOCKS_COUNT(es) - freeb;
 	sector_div(blocks_per_dir, ndirs);
 
 	max_dirs = ndirs / ngroups + inodes_per_group / 16;
diff -puN fs/ext3/inode.c~64bit-metadata fs/ext3/inode.c
--- linux-2.6.17/fs/ext3/inode.c~64bit-metadata	2006-06-28 16:47:14.241937242 -0700
+++ linux-2.6.17-ming/fs/ext3/inode.c	2006-06-28 16:47:14.263934718 -0700
@@ -2433,8 +2433,9 @@ static ext3_fsblk_t ext3_get_inode_block
 	 */
 	offset = ((ino - 1) % EXT3_INODES_PER_GROUP(sb)) *
 		EXT3_INODE_SIZE(sb);
-	block = le32_to_cpu(gdp[desc].bg_inode_table) +
-		(offset >> EXT3_BLOCK_SIZE_BITS(sb));
+	block = EXT3_INODE_TABLE((gdp+desc),
+			ext3_group_first_block_no(sb, block_group)) +
+			(offset >> EXT3_BLOCK_SIZE_BITS(sb));
 
 	iloc->block_group = block_group;
 	iloc->offset = offset & (EXT3_BLOCK_SIZE(sb) - 1);
@@ -2501,7 +2502,9 @@ static int __ext3_get_inode_loc(struct i
 				goto make_io;
 
 			bitmap_bh = sb_getblk(inode->i_sb,
-					le32_to_cpu(desc->bg_inode_bitmap));
+				EXT3_INODE_BITMAP(desc,
+				     ext3_group_first_block_no(inode->i_sb,
+						     block_group)));
 			if (!bitmap_bh)
 				goto make_io;
 
diff -puN fs/ext3/resize.c~64bit-metadata fs/ext3/resize.c
--- linux-2.6.17/fs/ext3/resize.c~64bit-metadata	2006-06-28 16:47:14.245936783 -0700
+++ linux-2.6.17-ming/fs/ext3/resize.c	2006-06-28 16:47:14.266934373 -0700
@@ -27,7 +27,7 @@ static int verify_group_input(struct sup
 {
 	struct ext3_sb_info *sbi = EXT3_SB(sb);
 	struct ext3_super_block *es = sbi->s_es;
-	ext3_fsblk_t start = le32_to_cpu(es->s_blocks_count);
+	ext3_fsblk_t start = EXT3_BLOCKS_COUNT(es);
 	ext3_fsblk_t end = start + input->blocks_count;
 	unsigned group = input->group;
 	ext3_fsblk_t itend = input->inode_table + sbi->s_itb_per_group;
@@ -817,9 +817,12 @@ int ext3_group_add(struct super_block *s
 	/* Update group descriptor block for new group */
 	gdp = (struct ext3_group_desc *)primary->b_data + gdb_off;
 
-	gdp->bg_block_bitmap = cpu_to_le32(input->block_bitmap);
-	gdp->bg_inode_bitmap = cpu_to_le32(input->inode_bitmap);
-	gdp->bg_inode_table = cpu_to_le32(input->inode_table);
+	EXT3_BLOCK_BITMAP_SET(gdp, ext3_group_first_block_no(sb, gdb_num),
+				   input->block_bitmap); /* LV FIXME */
+	EXT3_INODE_BITMAP_SET(gdp, ext3_group_first_block_no(sb, gdb_num),
+			           input->inode_bitmap); /* LV FIXME */
+	EXT3_INODE_TABLE_SET(gdp, ext3_group_first_block_no(sb, gdb_num),
+			          input->inode_table); /* LV FIXME */
 	gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count);
 	gdp->bg_free_inodes_count = cpu_to_le16(EXT3_INODES_PER_GROUP(sb));
 
@@ -833,7 +836,7 @@ int ext3_group_add(struct super_block *s
 	 * blocks/inodes before the group is live won't actually let us
 	 * allocate the new space yet.
 	 */
-	es->s_blocks_count = cpu_to_le32(le32_to_cpu(es->s_blocks_count) +
+	EXT3_BLOCKS_COUNT_SET(es, EXT3_BLOCKS_COUNT(es) +
 		input->blocks_count);
 	es->s_inodes_count = cpu_to_le32(le32_to_cpu(es->s_inodes_count) +
 		EXT3_INODES_PER_GROUP(sb));
@@ -869,7 +872,7 @@ int ext3_group_add(struct super_block *s
 
 	/* Update the reserved block counts only once the new group is
 	 * active. */
-	es->s_r_blocks_count = cpu_to_le32(le32_to_cpu(es->s_r_blocks_count) +
+	EXT3_R_BLOCKS_COUNT_SET(es, EXT3_R_BLOCKS_COUNT(es) +
 		input->reserved_blocks);
 
 	/* Update the free space counts */
@@ -920,7 +923,7 @@ int ext3_group_extend(struct super_block
 	/* We don't need to worry about locking wrt other resizers just
 	 * yet: we're going to revalidate es->s_blocks_count after
 	 * taking lock_super() below. */
-	o_blocks_count = le32_to_cpu(es->s_blocks_count);
+	o_blocks_count = EXT3_BLOCKS_COUNT(es);
 	o_groups_count = EXT3_SB(sb)->s_groups_count;
 
 	if (test_opt(sb, DEBUG))
@@ -986,7 +989,7 @@ int ext3_group_extend(struct super_block
 	}
 
 	lock_super(sb);
-	if (o_blocks_count != le32_to_cpu(es->s_blocks_count)) {
+	if (o_blocks_count != EXT3_BLOCKS_COUNT(es)) {
 		ext3_warning(sb, __FUNCTION__,
 			     "multiple resizers run on filesystem!");
 		unlock_super(sb);
@@ -1002,7 +1005,7 @@ int ext3_group_extend(struct super_block
 		ext3_journal_stop(handle);
 		goto exit_put;
 	}
-	es->s_blocks_count = cpu_to_le32(o_blocks_count + add);
+	EXT3_BLOCKS_COUNT_SET(es, o_blocks_count + add);
 	ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh);
 	sb->s_dirt = 1;
 	unlock_super(sb);
@@ -1014,8 +1017,8 @@ int ext3_group_extend(struct super_block
 	if ((err = ext3_journal_stop(handle)))
 		goto exit_put;
 	if (test_opt(sb, DEBUG))
-		printk(KERN_DEBUG "EXT3-fs: extended group to %u blocks\n",
-		       le32_to_cpu(es->s_blocks_count));
+		printk(KERN_DEBUG "EXT3-fs: extended group to %llu blocks\n",
+		       EXT3_BLOCKS_COUNT(es));
 	update_backups(sb, EXT3_SB(sb)->s_sbh->b_blocknr, (char *)es,
 		       sizeof(struct ext3_super_block));
 exit_put:
diff -puN fs/ext3/super.c~64bit-metadata fs/ext3/super.c
--- linux-2.6.17/fs/ext3/super.c~64bit-metadata	2006-06-28 16:47:14.248936438 -0700
+++ linux-2.6.17-ming/fs/ext3/super.c	2006-06-28 16:47:14.270933914 -0700
@@ -1151,44 +1151,48 @@ static int ext3_check_descriptors (struc
 		if ((i % EXT3_DESC_PER_BLOCK(sb)) == 0)
 			gdp = (struct ext3_group_desc *)
 					sbi->s_group_desc[desc_block++]->b_data;
-		if (le32_to_cpu(gdp->bg_block_bitmap) < block ||
-		    le32_to_cpu(gdp->bg_block_bitmap) >=
+		if (EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, i)) <
+				block ||
+		    	EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, i)) >=
 				block + EXT3_BLOCKS_PER_GROUP(sb))
 		{
 			ext3_error (sb, "ext3_check_descriptors",
 				    "Block bitmap for group %d"
 				    " not in group (block %lu)!",
 				    i, (unsigned long)
-					le32_to_cpu(gdp->bg_block_bitmap));
+					EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, i)));
 			return 0;
 		}
-		if (le32_to_cpu(gdp->bg_inode_bitmap) < block ||
-		    le32_to_cpu(gdp->bg_inode_bitmap) >=
+		if (EXT3_INODE_BITMAP(gdp, ext3_group_first_block_no(sb, i)) <
+				block ||
+		    EXT3_INODE_BITMAP(gdp, ext3_group_first_block_no(sb, i)) >=
 				block + EXT3_BLOCKS_PER_GROUP(sb))
 		{
 			ext3_error (sb, "ext3_check_descriptors",
 				    "Inode bitmap for group %d"
 				    " not in group (block %lu)!",
 				    i, (unsigned long)
-					le32_to_cpu(gdp->bg_inode_bitmap));
+					EXT3_INODE_BITMAP(gdp, ext3_group_first_block_no(sb, i)));
 			return 0;
 		}
-		if (le32_to_cpu(gdp->bg_inode_table) < block ||
-		    le32_to_cpu(gdp->bg_inode_table) + sbi->s_itb_per_group >=
-		    block + EXT3_BLOCKS_PER_GROUP(sb))
+		if (EXT3_INODE_TABLE(gdp, ext3_group_first_block_no(sb, i)) <
+			block ||
+		    EXT3_INODE_TABLE(gdp, ext3_group_first_block_no(sb, i)) +
+							sbi->s_itb_per_group >=
+			block + EXT3_BLOCKS_PER_GROUP(sb))
 		{
 			ext3_error (sb, "ext3_check_descriptors",
 				    "Inode table for group %d"
 				    " not in group (block %lu)!",
 				    i, (unsigned long)
-					le32_to_cpu(gdp->bg_inode_table));
+					EXT3_INODE_TABLE(gdp, ext3_group_first_block_no(sb, i)));
 			return 0;
 		}
 		block += EXT3_BLOCKS_PER_GROUP(sb);
 		gdp++;
 	}
 
-	sbi->s_es->s_free_blocks_count=cpu_to_le32(ext3_count_free_blocks(sb));
+	EXT3_FREE_BLOCKS_COUNT_SET(sbi->s_es, ext3_count_free_blocks(sb));
 	sbi->s_es->s_free_inodes_count=cpu_to_le32(ext3_count_free_inodes(sb));
 	return 1;
 }
@@ -1365,6 +1369,7 @@ static int ext3_fill_super (struct super
 	int i;
 	int needs_recovery;
 	__le32 features;
+	__u64 blocks_count;
 
 	sbi = kmalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1575,7 +1580,7 @@ static int ext3_fill_super (struct super
 		goto failed_mount;
 	}
 
-	if (le32_to_cpu(es->s_blocks_count) >
+	if (EXT3_BLOCKS_COUNT(es) >
 		    (sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) {
 		printk(KERN_ERR "EXT3-fs: filesystem on %s:"
 			" too large to mount safely\n", sb->s_id);
@@ -1587,10 +1592,11 @@ static int ext3_fill_super (struct super
 
 	if (EXT3_BLOCKS_PER_GROUP(sb) == 0)
 		goto cantfind_ext3;
-	sbi->s_groups_count = (le32_to_cpu(es->s_blocks_count) -
-			       le32_to_cpu(es->s_first_data_block) +
-			       EXT3_BLOCKS_PER_GROUP(sb) - 1) /
-			      EXT3_BLOCKS_PER_GROUP(sb);
+	blocks_count = (EXT3_BLOCKS_COUNT(es) -
+			le32_to_cpu(es->s_first_data_block) +
+			EXT3_BLOCKS_PER_GROUP(sb) - 1);
+	do_div(blocks_count, EXT3_BLOCKS_PER_GROUP(sb));
+	sbi->s_groups_count = blocks_count;
 	db_count = (sbi->s_groups_count + EXT3_DESC_PER_BLOCK(sb) - 1) /
 		   EXT3_DESC_PER_BLOCK(sb);
 	sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
@@ -1904,7 +1910,7 @@ static journal_t *ext3_get_dev_journal(s
 		goto out_bdev;
 	}
 
-	len = le32_to_cpu(es->s_blocks_count);
+	len = EXT3_BLOCKS_COUNT(es);
 	start = sb_block + 1;
 	brelse(bh);	/* we're done with the superblock */
 
@@ -2074,7 +2080,7 @@ static void ext3_commit_super (struct su
 	if (!sbh)
 		return;
 	es->s_wtime = cpu_to_le32(get_seconds());
-	es->s_free_blocks_count = cpu_to_le32(ext3_count_free_blocks(sb));
+	EXT3_FREE_BLOCKS_COUNT_SET(es, ext3_count_free_blocks(sb));
 	es->s_free_inodes_count = cpu_to_le32(ext3_count_free_inodes(sb));
 	BUFFER_TRACE(sbh, "marking dirty");
 	mark_buffer_dirty(sbh);
@@ -2267,7 +2273,7 @@ static int ext3_remount (struct super_bl
 	ext3_init_journal_params(sb, sbi->s_journal);
 
 	if ((*flags & MS_RDONLY) != (sb->s_flags & MS_RDONLY) ||
-		n_blocks_count > le32_to_cpu(es->s_blocks_count)) {
+		n_blocks_count > EXT3_BLOCKS_COUNT(es)) {
 		if (sbi->s_mount_opt & EXT3_MOUNT_ABORT) {
 			err = -EROFS;
 			goto restore_opts;
@@ -2388,10 +2394,10 @@ static int ext3_statfs (struct dentry * 
 
 	buf->f_type = EXT3_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
-	buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
+	buf->f_blocks = EXT3_BLOCKS_COUNT(es) - overhead;
 	buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
-	buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
-	if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
+	buf->f_bavail = buf->f_bfree - EXT3_R_BLOCKS_COUNT(es);
+	if (buf->f_bfree < EXT3_R_BLOCKS_COUNT(es))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
 	buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
diff -puN include/linux/ext3_fs.h~64bit-metadata include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~64bit-metadata	2006-06-28 16:47:14.252935980 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h	2006-06-28 16:47:14.273933570 -0700
@@ -136,6 +136,54 @@ struct ext3_group_desc
 	__le32	bg_reserved[3];
 };
 
+static inline u32 EXT3_RELATIVE_ENCODE(ext3_fsblk_t group_base,
+				       ext3_fsblk_t fs_block)
+{
+	s32 gdp_block;
+
+	if (fs_block < (1ULL<<32) && group_base < (1ULL<<32))
+		return fs_block;
+
+	gdp_block = (fs_block - group_base);
+	BUG_ON ((group_base + gdp_block) != fs_block);
+
+	return gdp_block;
+}
+
+static inline ext3_fsblk_t EXT3_RELATIVE_DECODE(ext3_fsblk_t group_base,
+						u32 gdp_block)
+{
+	if (group_base >= (1ULL<<32))
+		return group_base + (s32) gdp_block;
+
+	if ((s32) gdp_block >= 0 && gdp_block < group_base &&
+		  group_base + gdp_block >= (1ULL<<32))
+		return group_base + gdp_block;
+
+	return gdp_block;
+}
+
+#define EXT3_BLOCK_BITMAP(bg, group_base)	\
+		EXT3_RELATIVE_DECODE(group_base, le32_to_cpu((bg)->bg_block_bitmap))
+#define EXT3_INODE_BITMAP(bg, group_base)	\
+		EXT3_RELATIVE_DECODE(group_base, le32_to_cpu((bg)->bg_inode_bitmap))
+#define EXT3_INODE_TABLE(bg, group_base)	\
+		EXT3_RELATIVE_DECODE(group_base, le32_to_cpu((bg)->bg_inode_table))
+
+#define EXT3_BLOCK_BITMAP_SET(bg, group_base, value)	\
+	do {(bg)->bg_block_bitmap = EXT3_RELATIVE_ENCODE(group_base, value);} while(0)
+#define EXT3_INODE_BITMAP_SET(bg, group_base, value)	\
+	do {(bg)->bg_inode_bitmap = EXT3_RELATIVE_ENCODE(group_base, value);} while(0)
+#define EXT3_INODE_TABLE_SET(bg, group_base, value)	\
+	do {(bg)->bg_inode_table = EXT3_RELATIVE_ENCODE(group_base, value);} while(0)
+
+#define EXT3_IS_USED_BLOCK_BITMAP(bg)	\
+	((bg)->bg_block_bitmap != 0)
+#define EXT3_IS_USED_INODE_BITMAP(bg)	\
+	((bg)->bg_inode_bitmap != 0)
+#define EXT3_IS_USED_INODE_TABLE(bg)	\
+	((bg)->bg_inode_table != 0)
+
 /*
  * Macro-instructions used to manage group descriptors
  */
@@ -483,9 +531,38 @@ struct ext3_super_block {
 	__u16	s_reserved_word_pad;
 	__le32	s_default_mount_opts;
 	__le32	s_first_meta_bg; 	/* First metablock block group */
-	__u32	s_reserved[190];	/* Padding to the end of the block */
+	/* 64bit support valid if EXT3_FEATURE_COMPAT_64BIT */
+	__le32	s_blocks_count_hi;	/* Blocks count */
+/*100*/	__le32	s_r_blocks_count_hi;	/* Reserved blocks count */
+	__le32	s_free_blocks_count_hi;	/* Free blocks count */
+	__u32	s_reserved[187];	/* Padding to the end of the block */
 };
 
+
+#define EXT3_BLOCKS_COUNT(s)	\
+	(ext3_fsblk_t)(((__u64)le32_to_cpu((s)->s_blocks_count_hi) << 32) |	\
+	 	(__u64)le32_to_cpu((s)->s_blocks_count))
+#define EXT3_BLOCKS_COUNT_SET(s,v)	do {				\
+	(s)->s_blocks_count = cpu_to_le32((v));				\
+	(s)->s_blocks_count_hi = cpu_to_le32(((__u64)(v)) >> 32);	\
+} while (0)
+
+#define EXT3_R_BLOCKS_COUNT(s)	\
+	(ext3_fsblk_t)(((__u64)le32_to_cpu((s)->s_r_blocks_count_hi) << 32) |	\
+		 (__u64)le32_to_cpu((s)->s_r_blocks_count))
+#define EXT3_R_BLOCKS_COUNT_SET(s,v)	do {				\
+	(s)->s_r_blocks_count = cpu_to_le32((v));			\
+	(s)->s_r_blocks_count_hi = cpu_to_le32(((__u64)(v)) >> 32);	\
+} while (0)
+
+#define EXT3_FREE_BLOCKS_COUNT(s)					\
+	(ext3_fsblk_t)(((__u64)le32_to_cpu((s)->s_free_blocks_count_hi) << 32) | \
+		 (__u64)le32_to_cpu((s)->s_free_blocks_count))
+#define EXT3_FREE_BLOCKS_COUNT_SET(s,v)	do {				\
+	(s)->s_free_blocks_count = cpu_to_le32((v));			\
+	(s)->s_free_blocks_count_hi = cpu_to_le32(((__u64)(v)) >> 32);	\
+} while (0)
+
 #ifdef __KERNEL__
 #include <linux/ext3_fs_i.h>
 #include <linux/ext3_fs_sb.h>
@@ -559,6 +636,7 @@ static inline struct ext3_inode_info *EX
 #define EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
 #define EXT3_FEATURE_RO_COMPAT_LARGE_FILE	0x0002
 #define EXT3_FEATURE_RO_COMPAT_BTREE_DIR	0x0004
+#define EXT3_FEATURE_RO_COMPAT_64BIT		0x0010
 
 #define EXT3_FEATURE_INCOMPAT_COMPRESSION	0x0001
 #define EXT3_FEATURE_INCOMPAT_FILETYPE		0x0002
@@ -574,7 +652,8 @@ static inline struct ext3_inode_info *EX
 					 EXT3_FEATURE_INCOMPAT_EXTENTS)
 #define EXT3_FEATURE_RO_COMPAT_SUPP	(EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \
-					 EXT3_FEATURE_RO_COMPAT_BTREE_DIR)
+					 EXT3_FEATURE_RO_COMPAT_BTREE_DIR| \
+					 EXT3_FEATURE_RO_COMPAT_64BIT)
 
 /*
  * Default values for user and/or group using reserved blocks

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 15/16] compile warning fix and change 64bit to INCOMPAT feature
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (17 preceding siblings ...)
  2006-06-30  0:19 ` [RFC][Update][Patch 14/16] 48bit super block (metadata) changes Mingming Cao
@ 2006-06-30  0:19 ` Mingming Cao
  2006-06-30  0:19 ` [RFC][Update][Patch 16/16]Update ext3 superblock definition Mingming Cao
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

Change the 64bit to INCOMPAT feature, and fixed compile warning in the 64bit_metadata patch.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>


---

 linux-2.6.17-ming/include/linux/ext3_fs.h |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff -puN include/linux/ext3_fs.h~64bit-incompat-flag-change include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~64bit-incompat-flag-change	2006-06-28 16:47:16.224709734 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h	2006-06-28 16:49:17.471797610 -0700
@@ -136,6 +136,9 @@ struct ext3_group_desc
 	__le32	bg_reserved[3];
 };
 
+#ifdef __KERNEL__
+#include <linux/ext3_fs_i.h>
+#include <linux/ext3_fs_sb.h>
 static inline u32 EXT3_RELATIVE_ENCODE(ext3_fsblk_t group_base,
 				       ext3_fsblk_t fs_block)
 {
@@ -183,7 +186,7 @@ static inline ext3_fsblk_t EXT3_RELATIVE
 	((bg)->bg_inode_bitmap != 0)
 #define EXT3_IS_USED_INODE_TABLE(bg)	\
 	((bg)->bg_inode_table != 0)
-
+#endif
 /*
  * Macro-instructions used to manage group descriptors
  */
@@ -564,8 +567,6 @@ struct ext3_super_block {
 } while (0)
 
 #ifdef __KERNEL__
-#include <linux/ext3_fs_i.h>
-#include <linux/ext3_fs_sb.h>
 static inline struct ext3_sb_info * EXT3_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
@@ -636,7 +637,6 @@ static inline struct ext3_inode_info *EX
 #define EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
 #define EXT3_FEATURE_RO_COMPAT_LARGE_FILE	0x0002
 #define EXT3_FEATURE_RO_COMPAT_BTREE_DIR	0x0004
-#define EXT3_FEATURE_RO_COMPAT_64BIT		0x0010
 
 #define EXT3_FEATURE_INCOMPAT_COMPRESSION	0x0001
 #define EXT3_FEATURE_INCOMPAT_FILETYPE		0x0002
@@ -644,16 +644,17 @@ static inline struct ext3_inode_info *EX
 #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV	0x0008 /* Journal device */
 #define EXT3_FEATURE_INCOMPAT_META_BG		0x0010
 #define EXT3_FEATURE_INCOMPAT_EXTENTS		0x0040 /* extents support */
+#define EXT3_FEATURE_INCOMPAT_64BIT		0x0080
 
 #define EXT3_FEATURE_COMPAT_SUPP	EXT2_FEATURE_COMPAT_EXT_ATTR
 #define EXT3_FEATURE_INCOMPAT_SUPP	(EXT3_FEATURE_INCOMPAT_FILETYPE| \
 					 EXT3_FEATURE_INCOMPAT_RECOVER| \
 					 EXT3_FEATURE_INCOMPAT_META_BG| \
-					 EXT3_FEATURE_INCOMPAT_EXTENTS)
+					 EXT3_FEATURE_INCOMPAT_EXTENTS| \
+					 EXT3_FEATURE_INCOMPAT_64BIT)
 #define EXT3_FEATURE_RO_COMPAT_SUPP	(EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \
-					 EXT3_FEATURE_RO_COMPAT_BTREE_DIR| \
-					 EXT3_FEATURE_RO_COMPAT_64BIT)
+					 EXT3_FEATURE_RO_COMPAT_BTREE_DIR)
 
 /*
  * Default values for user and/or group using reserved blocks

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* [RFC][Update][Patch 16/16]Update ext3 superblock definition
  2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
                   ` (18 preceding siblings ...)
  2006-06-30  0:19 ` [RFC][Update][Patch 15/16] compile warning fix and change 64bit to INCOMPAT feature Mingming Cao
@ 2006-06-30  0:19 ` Mingming Cao
  19 siblings, 0 replies; 296+ messages in thread
From: Mingming Cao @ 2006-06-30  0:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, ext2-devel

The ext3 on-disk superblock definition in the kernel is lagging
behind some e2fsprogs-only fields (the backup of the journal inode,
and the mkfs timestamp), leading to the high bits of the fs size
being declared in a field already reserved by e2fsprogs.  Bring them
back in sync.

Signed-off-by: Stephen Tweedie <sct@redhat.com>


---

 linux-2.6.17-ming/include/linux/ext3_fs.h |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff -puN include/linux/ext3_fs.h~ext3-sb-struc-sync-with-e2fsprog include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~ext3-sb-struc-sync-with-e2fsprog	2006-06-28 16:47:18.377462723 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h	2006-06-28 16:47:18.381462264 -0700
@@ -532,13 +532,15 @@ struct ext3_super_block {
 	__u8	s_def_hash_version;	/* Default hash version to use */
 	__u8	s_reserved_char_pad;
 	__u16	s_reserved_word_pad;
-	__le32	s_default_mount_opts;
+/*100*/	__le32	s_default_mount_opts;
 	__le32	s_first_meta_bg; 	/* First metablock block group */
+	__le32	s_mkfs_time;		/* When the filesystem was created */
+	__le32	s_jnl_blocks[17]; 	/* Backup of the journal inode */
 	/* 64bit support valid if EXT3_FEATURE_COMPAT_64BIT */
-	__le32	s_blocks_count_hi;	/* Blocks count */
-/*100*/	__le32	s_r_blocks_count_hi;	/* Reserved blocks count */
+/*150*/	__le32	s_blocks_count_hi;	/* Blocks count */
+	__le32	s_r_blocks_count_hi;	/* Reserved blocks count */
 	__le32	s_free_blocks_count_hi;	/* Free blocks count */
-	__u32	s_reserved[187];	/* Padding to the end of the block */
+	__u32	s_reserved[169];	/* Padding to the end of the block */
 };
 
 

_



Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

^ permalink raw reply	[flat|nested] 296+ messages in thread

* Re: [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code
  2006-06-30  0:18 ` [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code Mingming Cao
@ 2006-06-30  3:15   ` H. Peter Anvin
  0 siblings, 0 replies; 296+ messages in thread
From: H. Peter Anvin @ 2006-06-30  3:15 UTC (permalink / raw)
  To: cmm; +Cc: linux-kernel, ext2-devel, linux-fsdevel

Mingming Cao wrote:
> "val >> 32" is undefined if val is a 32-bit value, so this code is
> broken if CONFIG_LBD is not set.  Make it safe for that case.
> 
> Signed-off-by: Stephen Tweedie <sct@redhat.com>
> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
> 
> 
> ---
> 
>  linux-2.6.17-ming/fs/jbd/revoke.c |    2 +-
>  1 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff -puN fs/jbd/revoke.c~jbd-revoke-32bit-shift-fix fs/jbd/revoke.c
> --- linux-2.6.17/fs/jbd/revoke.c~jbd-revoke-32bit-shift-fix	2006-06-28 16:47:09.695458913 -0700
> +++ linux-2.6.17-ming/fs/jbd/revoke.c	2006-06-28 16:47:09.699458454 -0700
> @@ -110,7 +110,7 @@ static inline int hash(journal_t *journa
>  {
>  	struct jbd_revoke_table_s *table = journal->j_revoke;
>  	int hash_shift = table->hash_shift;
> -	int hash = (int)block ^ (int)(block >> 32);
> +	int hash = (int)block ^ (int)((block >> 31) >> 1);
>  

It might be better to code it as:

	(int)((u64)block >> 32)

... which gcc can trivially recognize as 0 if block is 32 bits.  Not 
sure if it can do that with the code above.

	-hpa

^ permalink raw reply	[flat|nested] 296+ messages in thread

end of thread, other threads:[~2006-06-30  3:15 UTC | newest]

Thread overview: 296+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-09  1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
2006-06-09  2:40 ` Valdis.Kletnieks
2006-06-09  8:20   ` Andreas Dilger
2006-06-09 18:35     ` [Ext2-devel] " Stephen C. Tweedie
2006-06-09 19:20       ` Jeff Garzik
2006-06-09 19:28         ` Alex Tomas
2006-06-09 19:32           ` Jeff Garzik
2006-06-09 19:41             ` Alex Tomas
2006-06-09 15:23   ` Mingming Cao
2006-06-09  2:49 ` Jeff Garzik
2006-06-09  8:35   ` Andreas Dilger
2006-06-09 15:08     ` Jeff Garzik
2006-06-09 15:25       ` Jeff Garzik
2006-06-09 15:40         ` Linus Torvalds
2006-06-09 15:47           ` Jeff Garzik
2006-06-09 15:55             ` Alex Tomas
2006-06-09 15:56               ` Jeff Garzik
2006-06-09 16:07                 ` Alex Tomas
2006-06-09 16:09                   ` [Ext2-devel] " Jeff Garzik
2006-06-09 18:04                   ` Matthew Frost
2006-06-09 18:10                     ` Alex Tomas
2006-06-09 18:14                     ` [Ext2-devel] " Andreas Dilger
2006-06-09 18:51                       ` Jeff Garzik
2006-06-09 19:39                         ` Gerrit Huizenga
2006-06-09 19:45                           ` [Ext2-devel] " Jeff Garzik
2006-06-09 20:38                             ` Gerrit Huizenga
2006-06-10 10:03                           ` Christoph Hellwig
2006-06-09 19:49                         ` [Ext2-devel] " Theodore Tso
2006-06-09 20:04                           ` Jeff Garzik
2006-06-09 20:57                             ` Stephen C. Tweedie
2006-06-09 21:49                               ` Jeff Garzik
2006-06-09 21:55                                 ` [Ext2-devel] " Stephen C. Tweedie
2006-06-09 23:44                                   ` Jeff Garzik
2006-06-10  0:45                                     ` [Ext2-devel] " Andreas Dilger
2006-06-10  0:47                                     ` Theodore Tso
2006-06-10  1:09                                       ` Jeff Garzik
2006-06-10  1:30                                         ` [Ext2-devel] " Andreas Dilger
2006-06-10  1:43                                           ` Jeff Garzik
2006-06-10  2:03                                             ` Theodore Tso
2006-06-10  2:11                                               ` [Ext2-devel] " Jeff Garzik
2006-06-10  2:54                                                 ` Theodore Tso
2006-06-10  3:11                                                   ` Jeff Garzik
2006-06-10 12:15                                                     ` Theodore Tso
2006-06-10 14:31                                                       ` Jeff Garzik
2006-06-10  2:58                                               ` [Ext2-devel] " Jeff Garzik
2006-06-10  2:26                                             ` Andreas Dilger
2006-06-10  2:31                                               ` Jeff Garzik
2006-06-10  4:22                                                 ` Andreas Dilger
2006-06-09 22:37                             ` [Ext2-devel] " Andreas Dilger
2006-06-11 16:02                         ` Arjan van de Ven
2006-06-11 16:30                           ` Nikita Danilov
2006-06-11 16:55                             ` [Ext2-devel] " Arjan van de Ven
2006-06-12  6:35                           ` Andreas Dilger
2006-06-12 22:06                         ` [Ext2-devel] " Pavel Machek
2006-06-14 14:31                           ` Barry K. Nathan
2006-06-14 21:34                             ` [Ext2-devel] " Pavel Machek
2006-06-15  0:28                               ` Barry K. Nathan
2006-06-15  4:55                                 ` Theodore Tso
2006-06-15  7:43                                   ` Barry K. Nathan
2006-06-15  9:15                                 ` Pavel Machek
2006-06-15  9:40                                   ` Barry K. Nathan
2006-06-15  9:50                                     ` [Ext2-devel] " Pavel Machek
2006-06-09 20:52                 ` Stephen C. Tweedie
2006-06-09 21:47                   ` [Ext2-devel] " Jeff Garzik
2006-06-10  0:41                     ` James Morris
2006-06-09 16:01             ` Linus Torvalds
2006-06-09 20:38             ` Stephen C. Tweedie
2006-06-09 15:57           ` Jeff Garzik
2006-06-09 16:10           ` [Ext2-devel] " Alex Tomas
2006-06-09 16:10             ` Jeff Garzik
2006-06-09 16:24               ` Erik Mouw
2006-06-09 16:28                 ` Jeff Garzik
2006-06-09 16:24               ` [Ext2-devel] " Chase Venters
2006-06-09 16:25               ` Alex Tomas
2006-06-09 16:28                 ` Jeff Garzik
2006-06-09 16:50                   ` Alex Tomas
2006-06-09 16:53                     ` [Ext2-devel] " Jeff Garzik
2006-06-09 17:01                       ` Alex Tomas
2006-06-09 17:10                         ` Jeff Garzik
2006-06-09 16:25             ` Linus Torvalds
2006-06-09 16:48               ` Alex Tomas
2006-06-09 16:54                 ` KELEMEN Peter
2006-06-09 16:55                 ` Jeff Garzik
2006-06-09 17:12                   ` [Ext2-devel] " Alex Tomas
2006-06-09 17:12                     ` Jeff Garzik
2006-06-09 19:57                   ` Theodore Tso
2006-06-09 20:09                     ` Jeff Garzik
2006-06-09 20:14                       ` Alex Tomas
2006-06-09 20:28                         ` Jeff Garzik
2006-06-19  7:48                         ` [Ext2-devel] " Helge Hafting
2006-06-09 20:38                     ` Joel Becker
2006-06-09 20:50                       ` Dave Jones
2006-06-09 21:09                         ` Joel Becker
2006-06-09 21:51                           ` Mike Snitzer
2006-06-09 21:32                         ` [Ext2-devel] " Jeff Garzik
2006-06-09 22:56                           ` Andreas Dilger
2006-06-09 23:06                             ` Linus Torvalds
2006-06-09 23:09                             ` Jeff Garzik
2006-06-09 23:37                               ` [Ext2-devel] " Andreas Dilger
2006-06-09 23:54                                 ` Linus Torvalds
2006-06-09 21:03                       ` Theodore Tso
2006-06-09 21:24                         ` Joel Becker
2006-06-09 21:36                           ` [Ext2-devel] " Chase Venters
2006-06-09 21:51                           ` Theodore Tso
2006-06-09 22:07                             ` Joel Becker
2006-06-09 22:31                               ` [Ext2-devel] " Theodore Tso
2006-06-09 22:47                                 ` Joel Becker
2006-06-09 23:54                                   ` [Ext2-devel] " Theodore Tso
2006-06-09 23:48                         ` Jeff Garzik
2006-06-12  8:58                     ` Jes Sorensen
2006-06-10  0:07                   ` Olivier Galibert
2006-06-10  0:13                     ` Jeff Garzik
2006-06-09 16:54               ` [Ext2-devel] " Linus Torvalds
2006-06-09 17:04                 ` Alex Tomas
2006-06-09 17:30                   ` [Ext2-devel] " Linus Torvalds
2006-06-09 17:41                     ` Matthew Wilcox
2006-06-09 17:50                       ` Jeff Garzik
2006-06-09 18:00                         ` Alex Tomas
2006-06-09 18:04                       ` [Ext2-devel] " Linus Torvalds
2006-06-09 18:17                       ` Michael Poole
2006-06-09 17:44                 ` Theodore Tso
2006-06-09 17:58                   ` Jeff Garzik
2006-06-09 18:10                 ` [Ext2-devel] " Andreas Dilger
2006-06-09 18:22                   ` Linus Torvalds
2006-06-09 18:30                     ` Alex Tomas
2006-06-09 18:38                       ` Linus Torvalds
2006-06-09 18:50                         ` [Ext2-devel] " Chase Venters
2006-06-09 19:00                           ` Chase Venters
2006-06-10 13:33                             ` Adrian Bunk
2006-06-09 19:01                           ` Jeff Garzik
2006-06-10 19:27                             ` Kyle Moffett
2006-06-10 19:44                               ` Linus Torvalds
2006-06-10 20:02                                 ` [Ext2-devel] " Linus Torvalds
2006-06-10 21:26                                   ` Theodore Tso
2006-06-10 21:31                                     ` Linus Torvalds
2006-06-10 22:12                                     ` Jeff Garzik
2006-06-10 22:21                                     ` Jeff Garzik
2006-06-11  4:39                                       ` Stable/devel policy - was Re: [Ext2-devel] " Neil Brown
2006-06-11  5:19                                         ` Stable/devel policy - was " Linus Torvalds
2006-06-11  7:32                                           ` Ingo Molnar
2006-06-13  0:28                                         ` Stable/devel policy - was Re: [Ext2-devel] " Mingming Cao
2006-06-09 19:21                           ` Alan Cox
2006-06-09 19:13                             ` [Ext2-devel] " Chase Venters
2006-06-09 19:24                             ` Alex Tomas
2006-06-09 19:25                               ` Jeff Garzik
2006-06-09 19:35                                 ` Alex Tomas
2006-06-09 19:35                                   ` [Ext2-devel] " Jeff Garzik
2006-06-09 20:44                                   ` Joel Becker
2006-06-09 20:49                                     ` Alex Tomas
2006-06-09 21:11                                       ` Joel Becker
2006-06-09 21:20                                         ` Alex Tomas
2006-06-09 21:29                                           ` Joel Becker
2006-06-09 21:33                                             ` Alex Tomas
2006-06-09 21:43                                             ` Joel Becker
2006-06-11 20:14                                   ` [Ext2-devel] " grundig
2006-06-14 16:45                                     ` Alex Tomas
2006-06-09 19:22                         ` Alex Tomas
2006-06-09 19:22                           ` Jeff Garzik
2006-06-09 20:16                         ` Andreas Dilger
2006-06-09 20:31                           ` Linus Torvalds
2006-06-09 20:31                           ` Jeff Garzik
2006-06-09 18:43                       ` [Ext2-devel] " Jeff Garzik
2006-06-09 18:50                       ` Diego Calleja
2006-06-09 19:08                         ` Diego Calleja
2006-06-09 18:40                   ` [Ext2-devel] " Jeff Garzik
2006-06-09 18:59                     ` Andrew Morton
2006-06-09 19:16                       ` Jeff Garzik
2006-06-09 20:27                         ` [Ext2-devel] " Chase Venters
2006-06-09 20:44                       ` Alan Cox
2006-06-11 15:52                         ` [Ext2-devel] " Arjan van de Ven
2006-06-09 18:41                   ` Jeff Garzik
2006-06-09 17:12               ` Jeff Anderson-Lee
2006-06-09 18:02               ` Andrew Morton
2006-06-10 19:10         ` Kyle Moffett
2006-06-10 19:27           ` Linus Torvalds
2006-06-09 15:28       ` [Ext2-devel] " Alex Tomas
2006-06-09 15:31         ` Matthew Wilcox
2006-06-10  3:26           ` Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3) Valerie Henson
2006-06-10  5:25             ` Andreas Dilger
2006-06-10  5:41               ` Valerie Henson
2006-06-10  6:22                 ` Andreas Dilger
2006-06-10 14:22             ` Jeff Garzik
2006-06-09 15:44         ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
2006-06-09 15:53           ` Alex Tomas
2006-06-09 15:52             ` Jeff Garzik
2006-06-09 16:02               ` Alex Tomas
2006-06-09 16:04                 ` [Ext2-devel] " Jeff Garzik
2006-06-09 18:29           ` Andreas Dilger
2006-06-09 15:53         ` [Ext2-devel] " Gerrit Huizenga
2006-06-09 16:03           ` Jeff Garzik
2006-06-09 16:09           ` Linus Torvalds
2006-06-09 17:58             ` Gerrit Huizenga
2006-06-09 18:25               ` [Ext2-devel] " Chase Venters
2006-06-10 13:46               ` Adrian Bunk
2006-06-10 14:42                 ` Ingo Molnar
2006-06-10 15:03                   ` Jeff Garzik
2006-06-11  6:00                     ` Ingo Molnar
2006-06-10 16:00                   ` Adrian Bunk
2006-06-10 16:05                   ` Christoph Hellwig
2006-06-10 23:05                   ` Mike Galbraith
2006-06-13 13:34               ` [Ext2-devel] " Helge Hafting
2006-06-09 20:32       ` Stephen C. Tweedie
2006-06-09 20:46         ` Linus Torvalds
2006-06-09 20:56           ` Alex Tomas
2006-06-20  6:15           ` [Ext2-devel] " Qi Yong
2006-06-20  8:26             ` Laurent Vivier
2006-06-20  8:30               ` Jeff Garzik
2006-06-20  9:21                 ` Laurent Vivier
2006-06-20  9:48                   ` Jeff Garzik
2006-06-20 10:40                     ` Laurent Vivier
2006-06-09 17:14   ` Alan Cox
2006-06-09  9:13 ` Christoph Hellwig
2006-06-09 10:07   ` Andrew Morton
2006-06-09 15:40     ` Jeff Garzik
2006-06-09 15:42       ` Matthew Wilcox
2006-06-09 15:51         ` Jeff Garzik
2006-06-09 17:29         ` Alan Cox
2006-06-09 16:56       ` Andrew Morton
2006-06-09 17:07         ` Jeff Garzik
2006-06-09 17:35           ` Andrew Morton
2006-06-09 17:48             ` Jeff Garzik
2006-06-09 17:59               ` Jeff Garzik
2006-06-09 18:27                 ` [Ext2-devel] " Mike Snitzer
2006-06-09 18:54                   ` Jeff Garzik
2006-06-09 19:22                     ` Alex Tomas
2006-06-09 19:23                       ` Jeff Garzik
2006-06-09 22:49                       ` Valdis.Kletnieks
2006-06-09 23:34                         ` [Ext2-devel] " Andreas Dilger
2006-06-10 13:49                   ` Adrian Bunk
2006-06-10 13:51                     ` Christoph Hellwig
2006-06-10 14:54                       ` Jeff Garzik
2006-06-10 18:01                         ` [Ext2-devel] " Andreas Dilger
2006-06-09 21:42             ` Sonny Rao
2006-06-09 22:15               ` Andrew Morton
2006-06-09 23:11                 ` Andreas Dilger
2006-06-09 23:15                   ` Jeff Garzik
2006-06-10  3:37                   ` Valerie Henson
2006-06-10  3:49                 ` Nathan Scott
2006-06-09 18:23       ` Michael Poole
2006-06-09 18:55         ` Jeff Garzik
2006-06-09 19:42           ` [Ext2-devel] " Gerrit Huizenga
2006-06-09 20:00             ` Jeff Garzik
2006-06-09 20:08               ` Alex Tomas
2006-06-09 20:10                 ` [Ext2-devel] " Jeff Garzik
2006-06-09 20:35               ` Theodore Tso
2006-06-09 21:41                 ` Jeff Garzik
2006-06-09 21:45                   ` [Ext2-devel] " Michael Poole
2006-06-09 21:53                     ` Jeff Garzik
2006-06-09 22:04                       ` Theodore Tso
2006-06-10  0:49         ` Sven-Haegar Koch
2006-06-10  1:06           ` Theodore Tso
2006-06-10 14:07             ` Olivier Galibert
2006-06-10 19:52               ` Theodore Tso
2006-06-09 10:49   ` Andreas Dilger
2006-06-09 11:26   ` Alex Tomas
2006-06-09 14:23     ` [Ext2-devel] " Jeff Garzik
2006-06-09 14:33       ` Alex Tomas
2006-06-09 14:34       ` Alex Tomas
2006-06-09 14:35         ` Jeff Garzik
2006-06-09 14:57           ` Alex Tomas
2006-06-09 15:17             ` [Ext2-devel] " Jeff Garzik
2006-06-09 16:21               ` Mike Snitzer
2006-06-09 16:27                 ` Jeff Garzik
2006-06-09 16:48                   ` Alex Tomas
2006-06-09 16:51                     ` Jeff Garzik
2006-06-09 16:33                 ` Alex Tomas
2006-06-09 16:37                   ` [Ext2-devel] " Jeff Garzik
2006-06-09 22:52                   ` Valdis.Kletnieks
2006-06-09 23:21                     ` Andreas Dilger
2006-06-10  1:21                       ` Valdis.Kletnieks
2006-06-10  2:09                         ` [Ext2-devel] " Andreas Dilger
2006-06-10  2:45                           ` Nicholas Miell
2006-06-10  4:29                             ` Andreas Dilger
2006-06-09 16:56               ` Andreas Dilger
2006-06-09 17:32                 ` [Ext2-devel] " Greg KH
2006-06-09 18:48                 ` Jeff Garzik
2006-06-30  0:16 ` [RFC][Update 0/16]extents and 48bit ext3/4 patches Mingming Cao
2006-06-30  0:16 ` [RFC][Update][Patch 1/16]core extent map support Mingming Cao
2006-06-30  0:17 ` [RFC][Update][Patch 2/16]sector_t type format string Mingming Cao
2006-06-30  0:17 ` [RFC][Update][Patch 3/16]convert ext3_fsblk_t to sector_t to support >32 bit block in kernel Mingming Cao
2006-06-30  0:17 ` [RFC][Update][Patch 4/16]support 48 bit blk number in extents Mingming Cao
2006-06-30  0:17 ` [RFC][Update][Patch 5/16]block type convert " Mingming Cao
2006-06-30  0:17 ` [RFC][Update][Patch 6/16]handing unitialized extents Mingming Cao
2006-06-30  0:17 ` [RFC][Update][Patch 7/16]Core 64 bit JBD changes Mingming Cao
2006-06-30  0:18 ` [RFC][Update][Patch 8/16]Avoid potential block overflow when writing journal metadata tags Mingming Cao
2006-06-30  0:18 ` [RFC][Update][Patch 9/16]Fix reading of 32-bit tag descriptors Mingming Cao
2006-06-30  0:18 ` [RFC][Update][Patch 10/16]Cleanup journal_tag_bytes() Mingming Cao
2006-06-30  0:18 ` [RFC][Update][Patch 11/16]JBD layer in-kernel block variables type fixes Mingming Cao
2006-06-30  0:18 ` [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code Mingming Cao
2006-06-30  3:15   ` H. Peter Anvin
2006-06-30  0:18 ` [RFC][Update][Patch 13/16] 48 bit on-disk i_file_acl support Mingming Cao
2006-06-30  0:19 ` [RFC][Update][Patch 14/16] 48bit super block (metadata) changes Mingming Cao
2006-06-30  0:19 ` [RFC][Update][Patch 15/16] compile warning fix and change 64bit to INCOMPAT feature Mingming Cao
2006-06-30  0:19 ` [RFC][Update][Patch 16/16]Update ext3 superblock definition Mingming Cao
  -- strict thread matches above, loose matches on Subject: below --
2006-06-11  8:22 [RFC 0/13] extents and 48bit ext3 linux
     [not found] <6lTUf-54A-17@gated-at.bofh.it>
     [not found] ` <6lU3S-5h5-11@gated-at.bofh.it>
     [not found]   ` <6lU3X-5h5-35@gated-at.bofh.it>
     [not found]     ` <6lUnl-5GL-5@gated-at.bofh.it>
     [not found]       ` <6lUwX-66U-25@gated-at.bofh.it>
     [not found]         ` <6lUQo-6w3-29@gated-at.bofh.it>
     [not found]           ` <6lUQp-6w3-35@gated-at.bofh.it>
     [not found]             ` <6lUZT-6HS-3@gated-at.bofh.it>
     [not found]               ` <6nE4Z-4If-55@gated-at.bofh.it>
2006-06-14 16:45                 ` [Ext2-devel] " Bodo Eggert
2006-06-14 17:28                   ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).