* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
@ 2006-06-09 2:40 ` Valdis.Kletnieks
2006-06-09 8:20 ` Andreas Dilger
2006-06-09 15:23 ` Mingming Cao
2006-06-09 2:49 ` Jeff Garzik
` (18 subsequent siblings)
19 siblings, 2 replies; 295+ messages in thread
From: Valdis.Kletnieks @ 2006-06-09 2:40 UTC (permalink / raw)
To: cmm; +Cc: linux-kernel, ext2-devel, linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 939 bytes --]
On Thu, 08 Jun 2006 18:20:54 PDT, Mingming Cao said:
> Current ext3 filesystem is limited to 8TB(4k block size), this is
> practically not enough for the increasing need of bigger storage as
> disks in a few years (or even now).
>
> To address this need, there are co-effort from RedHat, ClusterFS, IBM
> and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> ext3 is build on top of extent map changes for ext3, originally from
> Alex Tomas. In short, the new ext3 on-disk extents format is:
which implies matching changes to mkfs.ext2 and possibly mount..
> Appreciate any comments and feedbacks!
Somebody else was recently discussing a set of patches to ext3 for
extents+delalloc+mballoc patches - is this work compatible with that?
Also, a pointer to the matching userspace patches would help anybody
who's gung-ho enough to test the code....
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 2:40 ` Valdis.Kletnieks
@ 2006-06-09 8:20 ` Andreas Dilger
2006-06-09 18:35 ` [Ext2-devel] " Stephen C. Tweedie
2006-06-09 15:23 ` Mingming Cao
1 sibling, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 8:20 UTC (permalink / raw)
To: Valdis.Kletnieks; +Cc: cmm, linux-kernel, ext2-devel, linux-fsdevel
On Jun 08, 2006 22:40 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Thu, 08 Jun 2006 18:20:54 PDT, Mingming Cao said:
> > To address this need, there are co-effort from RedHat, ClusterFS, IBM
> > and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> > expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> > ext3 is build on top of extent map changes for ext3, originally from
> > Alex Tomas. In short, the new ext3 on-disk extents format is:
>
> which implies matching changes to mkfs.ext2 and possibly mount..
The extents format doesn't need any support from mke2fs. Currently this
is activated by a mount option "-o extents", so it won't be used until
a system administrator actively enables it.
> > Appreciate any comments and feedbacks!
>
> Somebody else was recently discussing a set of patches to ext3 for
> extents+delalloc+mballoc patches - is this work compatible with that?
Yes, completely compatible (author is the same person). We have all been
working to get these improvements into the vanilla kernel so that everyone
can benefit from the improved performance. These patches are just the
start - the mballoc and delalloc patches are follow-on patches, but they
do not affect the on-disk format just the in-memory implementation of
block allocation.
> Also, a pointer to the matching userspace patches would help anybody
> who's gung-ho enough to test the code....
They were posted to the ext2-devel mailing list previously, or you can
download a patched RPM at ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/
(the extent support is making its way into the official e2fsprogs also).
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 8:20 ` Andreas Dilger
@ 2006-06-09 18:35 ` Stephen C. Tweedie
2006-06-09 19:20 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 18:35 UTC (permalink / raw)
To: Andreas Dilger
Cc: Valdis.Kletnieks, linux-fsdevel, ext2-devel@lists.sourceforge.net,
Mingming Cao, linux-kernel, Stephen Tweedie
Hi,
On Fri, 2006-06-09 at 02:20 -0600, Andreas Dilger wrote:
> > which implies matching changes to mkfs.ext2 and possibly mount..
>
> The extents format doesn't need any support from mke2fs. Currently this
> is activated by a mount option "-o extents", so it won't be used until
> a system administrator actively enables it.
It does need support from e2fsprogs, though; patches have been posed to
ext2-devel and are available on
http://www.bullopensource.org/ext4/index.html
though there is work left to do, especially to improve fsck's ability to
repair partially-damaged extent trees.
--Stephen
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 2:40 ` Valdis.Kletnieks
2006-06-09 8:20 ` Andreas Dilger
@ 2006-06-09 15:23 ` Mingming Cao
1 sibling, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-09 15:23 UTC (permalink / raw)
To: Valdis.Kletnieks; +Cc: linux-fsdevel, ext2-devel, linux-kernel
Valdis.Kletnieks@vt.edu wrote:
> On Thu, 08 Jun 2006 18:20:54 PDT, Mingming Cao said:
>
>>Current ext3 filesystem is limited to 8TB(4k block size), this is
>>practically not enough for the increasing need of bigger storage as
>>disks in a few years (or even now).
>>
>>To address this need, there are co-effort from RedHat, ClusterFS, IBM
>>and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
>>expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
>>ext3 is build on top of extent map changes for ext3, originally from
>>Alex Tomas. In short, the new ext3 on-disk extents format is:
>
>
> which implies matching changes to mkfs.ext2 and possibly mount..
>
>
Alexandre Ratchov and Laurent Vivier from BULL have been done some work
in e2fsprog to support extents and 48/64 bit ext3, although the patches
have not been thoroughly reviewed and discussed yet...
http://marc.theaimsgroup.com/?l=ext2-devel&m=114848122624510&w=2
>>Appreciate any comments and feedbacks!
>
>
> Somebody else was recently discussing a set of patches to ext3 for
> extents+delalloc+mballoc patches - is this work compatible with that?
>
Yes, the extents patch you mentioned is the same one included in the
series. The delalloc (support delayed allocation for ext3) and mballoc (
support multiple block allocation based on extents) are considered a
future to add, as this series is intend to address the capability issue
and on-disk format only.
> Also, a pointer to the matching userspace patches would help anybody
> who's gung-ho enough to test the code....
>
Thanks for your interest!
We have tested patch 1-4 (which basically not touching any on-disk
format) and they have been in mm tree. Extent patch itself have been
tested for a long time by ClusterFS and IBM, as it's actually being
posted a while back.
At this point the whole series pass compile, but not being tested yet.
This post as a RFC is intend to collect comments and feedbacks. BULL
team has done some test on the 2.6.16 version of the series with the
e2fsprog changes they posted though. I will upload the matching
e2fsprogs changes to ext2.sf.net/48bitsext3 shortly..
Mingming
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
2006-06-09 2:40 ` Valdis.Kletnieks
@ 2006-06-09 2:49 ` Jeff Garzik
2006-06-09 8:35 ` Andreas Dilger
2006-06-09 17:14 ` Alan Cox
2006-06-09 9:13 ` Christoph Hellwig
` (17 subsequent siblings)
19 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 2:49 UTC (permalink / raw)
To: cmm, Andrew Morton, Linus Torvalds
Cc: linux-kernel, ext2-devel, linux-fsdevel
Mingming Cao wrote:
> Current ext3 filesystem is limited to 8TB(4k block size), this is
> practically not enough for the increasing need of bigger storage as
> disks in a few years (or even now).
>
> To address this need, there are co-effort from RedHat, ClusterFS, IBM
> and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> ext3 is build on top of extent map changes for ext3, originally from
> Alex Tomas. In short, the new ext3 on-disk extents format is:
One of my common complaints about massive ext3 updates such as this is
the ever-growing "which ext3 filesystem am I mounting?" problem.
I really think extents and 48bit-ness should imply
cp -a fs/ext3 fs/ext4
and go from there.
IMHO the ext3 back-compat situation is already really hairy, with all
the features added since the original ext3 release.
The alternative is continual bloating of ext3, and on filesystems,
inodes which are progressively upgraded -- meaning any use of a prior
kernel implies that you can only read a subset of your [meta]data, if
the back-compat code doesn't block the mount entirely.
People (including me) still switch back and forth between ext2 and ext3
mounts of the same filesystem on occasion. I think creating an "ext4"
would allow for greater developer flexibility in implementing new
features and ditching old ones -- while also emphasizing to the user
that switching back and forth between ext4 and ext[23] is a bad idea.
Overall, after applying extent (and 48bit) patches, I think it is wrong
to keep calling it ext3. That will break some existing user
assumptions, and continue to restrict developers' freedom to implement
nifty new features.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 2:49 ` Jeff Garzik
@ 2006-06-09 8:35 ` Andreas Dilger
2006-06-09 15:08 ` Jeff Garzik
2006-06-09 17:14 ` Alan Cox
1 sibling, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 8:35 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel
On Jun 08, 2006 22:49 -0400, Jeff Garzik wrote:
> One of my common complaints about massive ext3 updates such as this is
> the ever-growing "which ext3 filesystem am I mounting?" problem.
>
> I really think extents and 48bit-ness should imply
> cp -a fs/ext3 fs/ext4
> and go from there.
The problem with this approach (as seen with ext2 and ext3) is that one
tree or the other gets stale w.r.t. bug fixes and now we have the case
where ext2 has a noticably different implementation in some areas and
bug fixes are no longer trivial to apply to both trees.
I think all of the ext3 maintainers think this split was a bad idea in
hindsight, and having an ext3 mode where it can mount without a journal
would be much more desirable.
> IMHO the ext3 back-compat situation is already really hairy, with all
> the features added since the original ext3 release.
While partially true, ext2/ext3 has a very good history w.r.t. compatibility
(with one exception being the EAs on symlinks problem that slipped through
with selinux).
Yes, the extents format will be incompatible with older ext3, but it isn't
enabled by default so it will be completely up to the sysadmin when they
make their filesystem incompatible. They also won't impact any existing
files. The earlier extents support gets into a kernel.org kernel the
more systems will be able to mount a filesystem with the changes when
they becomes widely used.
All of the other features that are going to be introduced will only going
to be applicable for format time (filesystems larger than 16TB), or if
exceeding limits of the current ext3 support (e.g. files larger than 2TB
in size).
> People (including me) still switch back and forth between ext2 and ext3
> mounts of the same filesystem on occasion. I think creating an "ext4"
> would allow for greater developer flexibility in implementing new
> features and ditching old ones -- while also emphasizing to the user
> that switching back and forth between ext4 and ext[23] is a bad idea.
While this is partly true, one of the big benefits is that you can
transparently upgrade your system to use the new features and improve
performance without a long outage window. Having a completely separate
ext4 filesystem doesn't improve the compatibility story at all. There
has been renewed discussion on implementing "mounting ext3 without a
journal", just for a recovery mode, because ext2 will not be modified
to get all of these features (running e2fsck on a huge filesystem each
reboot would be insane).
> Overall, after applying extent (and 48bit) patches, I think it is wrong
> to keep calling it ext3. That will break some existing user
> assumptions, and continue to restrict developers' freedom to implement
> nifty new features.
Just FYI, all of the ext3 developers are on board with this patch series
and it has been discussed and reviewed for many weeks already, it isn't
just being pushed by one party.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 8:35 ` Andreas Dilger
@ 2006-06-09 15:08 ` Jeff Garzik
2006-06-09 15:25 ` Jeff Garzik
` (2 more replies)
0 siblings, 3 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:08 UTC (permalink / raw)
To: Andreas Dilger
Cc: cmm, Andrew Morton, Linus Torvalds, linux-kernel, ext2-devel,
linux-fsdevel
Please fix your mailer to stop creating bogus Mail-Followup-To headers,
headers which exclude the original poster, and cause compliant MUAs to
incorrectly build To/CC.
Andreas Dilger wrote:
> On Jun 08, 2006 22:49 -0400, Jeff Garzik wrote:
>> One of my common complaints about massive ext3 updates such as this is
>> the ever-growing "which ext3 filesystem am I mounting?" problem.
>>
>> I really think extents and 48bit-ness should imply
>> cp -a fs/ext3 fs/ext4
>> and go from there.
>
> The problem with this approach (as seen with ext2 and ext3) is that one
> tree or the other gets stale w.r.t. bug fixes and now we have the case
> where ext2 has a noticably different implementation in some areas and
> bug fixes are no longer trivial to apply to both trees.
>
> I think all of the ext3 maintainers think this split was a bad idea in
> hindsight, and having an ext3 mode where it can mount without a journal
> would be much more desirable.
Please look beyond just ext2/3. Other filesystems which have "version
1", "version 2", "version 3", ... formats are all nasty as hell. The
end-result bloated code essentially supports several filesystems, all
within the same code base, and its a nightmare of ugliness.
Further, its not only bloated, but slow. The code inevitably winds up
in one of two forms:
if (spiffy new-feature metadata)
...
else if (updated metadata)
...
else /* original metadata */
...
_or_ you add a level of indirection, by creating internal-to-the-fs
pointer operations.
Stuffing more and more features into fs/ext3 means you are following the
path that leads to reiser4... where EVERYTHING under the hood is
mutable, all within fs/ext3.
>> IMHO the ext3 back-compat situation is already really hairy, with all
>> the features added since the original ext3 release.
>
> While partially true, ext2/ext3 has a very good history w.r.t. compatibility
> (with one exception being the EAs on symlinks problem that slipped through
> with selinux).
>
> Yes, the extents format will be incompatible with older ext3, but it isn't
> enabled by default so it will be completely up to the sysadmin when they
> make their filesystem incompatible. They also won't impact any existing
> files. The earlier extents support gets into a kernel.org kernel the
> more systems will be able to mount a filesystem with the changes when
> they becomes widely used.
>
> All of the other features that are going to be introduced will only going
> to be applicable for format time (filesystems larger than 16TB), or if
> exceeding limits of the current ext3 support (e.g. files larger than 2TB
> in size).
Yet more progressive incompatibility, yet more
if (metadata v2)
...
else /* metadata v1 */
...
Why do you insist upon calling the end result ext3, when the truth is
that you are slowing rewriting ext3?
As time progresses, more and more admins must ask themselves the
question "what flavor of ext3 filesystem is on my hard drive?"
Here's a key question for ext3 developers, which I bet has no answer:
when is it enough? Is the plan to continually introduce incompatible
features into ext3, over time, ad infinitum?
>> People (including me) still switch back and forth between ext2 and ext3
>> mounts of the same filesystem on occasion. I think creating an "ext4"
>> would allow for greater developer flexibility in implementing new
>> features and ditching old ones -- while also emphasizing to the user
>> that switching back and forth between ext4 and ext[23] is a bad idea.
>
> While this is partly true, one of the big benefits is that you can
> transparently upgrade your system to use the new features and improve
> performance without a long outage window. Having a completely separate
Changing the name to ext4 doesn't erase this capability.
> ext4 filesystem doesn't improve the compatibility story at all. There
> has been renewed discussion on implementing "mounting ext3 without a
> journal", just for a recovery mode, because ext2 will not be modified
> to get all of these features (running e2fsck on a huge filesystem each
> reboot would be insane).
So now you are going backwards, and implementing ext2-within-ext3?
Are you ready to admit, yet, that ext3 is 100% mutable in the minds of
ext3 developers? Why not implement the minix filesystem format within
ext3, at this point? We could call it a "plugin", I bet.
>> Overall, after applying extent (and 48bit) patches, I think it is wrong
>> to keep calling it ext3. That will break some existing user
>> assumptions, and continue to restrict developers' freedom to implement
>> nifty new features.
>
> Just FYI, all of the ext3 developers are on board with this patch series
> and it has been discussed and reviewed for many weeks already, it isn't
> just being pushed by one party.
That is completely irrelevant to this thread.
If all the ext3 developers are on board, that just implies that there is
no clear definition of what "ext3" really means. With this patch
series, and with future plans described here and elsewhere, the name
"ext3" will become more and more meaningless. It could mean _any_ of
several filesystem metadata variants, and the admin will have no clue
which variant they are talking to until they try to mount the blkdev
(and possibly fail the mount).
At SOME point, clueful developers will say "we should better concentrate
our energy on a new filesystem."
But I see no one at all defining that "some point."
At some point you are beating a dead horse. At some point, you are
pushing features into a filesystem that was never designed to support
said features.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:08 ` Jeff Garzik
@ 2006-06-09 15:25 ` Jeff Garzik
2006-06-09 15:40 ` Linus Torvalds
2006-06-10 19:10 ` Kyle Moffett
2006-06-09 15:28 ` [Ext2-devel] " Alex Tomas
2006-06-09 20:32 ` Stephen C. Tweedie
2 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:25 UTC (permalink / raw)
To: linux-kernel, ext2-devel, linux-fsdevel
Cc: Andrew Morton, Linus Torvalds, cmm, Andreas Dilger
Overall, I'm surprised that ext3 developers don't see any of the
problems related to progressive, stealth filesystem upgrades.
Users are never given a clear indication of when their metadata is being
upgraded, there is no clear "line of demarcation" they cross, when they
start using extents.
Since there is no user-visible fs upgrade event, users do not have a
clear picture of what features are being used -- which means they are
kept in the dark about which kernels are OK to use on their data.
Do you guys honestly expect users to keep track of which kernels added
specific ext3 features?
This is why other enterprise filesystems have clear "fs version 1", "fs
version 2" points across which a user migrates. ext3's feature-flags
approach just means that there are a million combinations of potential
old-and-new features, in-tree and third party, all of which must be
supported.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:25 ` Jeff Garzik
@ 2006-06-09 15:40 ` Linus Torvalds
2006-06-09 15:47 ` Jeff Garzik
` (2 more replies)
2006-06-10 19:10 ` Kyle Moffett
1 sibling, 3 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 15:40 UTC (permalink / raw)
To: Jeff Garzik
Cc: linux-kernel, ext2-devel, linux-fsdevel, Andreas Dilger, cmm,
Andrew Morton
On Fri, 9 Jun 2006, Jeff Garzik wrote:
>
> Overall, I'm surprised that ext3 developers don't see any of the problems
> related to progressive, stealth filesystem upgrades.
Hey, they're used to it - they've been doing it for a long time.
In fact, ext3 wouldn't be ext3 unless I (and perhaps a few others) had
insisted on it. People wanted to try to upgrade ext2 in place.
And they've been upgrading it in-place for a long time.
Now, there are unquestionably advantages to that approach too, but as you
say, there are absolutely tons of disadvantages too. Bugs get much much
subtler, and more disastrous for old users that don't even want the new
features.
Quite frankly, at this point, there's no way in hell I believe we can do
major surgery on ext3. It's the main filesystem for a lot of users, and
it's just not worth the instability worries unless it's something very
obviously transparent.
I wouldn't mind an ext4 (that hopefully drops some of the features of
ext3, and might not downgrade to ext2 on errors, for example).
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:40 ` Linus Torvalds
@ 2006-06-09 15:47 ` Jeff Garzik
2006-06-09 15:55 ` Alex Tomas
` (2 more replies)
2006-06-09 15:57 ` Jeff Garzik
2006-06-09 16:10 ` [Ext2-devel] " Alex Tomas
2 siblings, 3 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, ext2-devel, linux-fsdevel, Andreas Dilger, cmm,
Andrew Morton
Linus Torvalds wrote:
>
> On Fri, 9 Jun 2006, Jeff Garzik wrote:
>> Overall, I'm surprised that ext3 developers don't see any of the problems
>> related to progressive, stealth filesystem upgrades.
>
> Hey, they're used to it - they've been doing it for a long time.
Agreed, but my argument is that extents are a Big Deal.
think about The Experience: Suddenly users that could use 2.4.x and
2.6.x are locked into 2.6.18+, by the simple and common act of writing
to a file.
No bells and whistles go off...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:47 ` Jeff Garzik
@ 2006-06-09 15:55 ` Alex Tomas
2006-06-09 15:56 ` Jeff Garzik
2006-06-09 16:01 ` Linus Torvalds
2006-06-09 20:38 ` Stephen C. Tweedie
2 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 15:55 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Andreas Dilger
>>>>> Jeff Garzik (JG) writes:
JG> think about The Experience: Suddenly users that could use 2.4.x and
JG> 2.6.x are locked into 2.6.18+, by the simple and common act of writing
JG> to a file.
sorry to repeat, but if they simple try 2.6.18, they won't get extents.
instead, they must specify extents mount option. and at this point
they must get clear that this is a way to get incompatible fs.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:55 ` Alex Tomas
@ 2006-06-09 15:56 ` Jeff Garzik
2006-06-09 16:07 ` Alex Tomas
2006-06-09 20:52 ` Stephen C. Tweedie
0 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:56 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> think about The Experience: Suddenly users that could use 2.4.x and
> JG> 2.6.x are locked into 2.6.18+, by the simple and common act of writing
> JG> to a file.
>
> sorry to repeat, but if they simple try 2.6.18, they won't get extents.
> instead, they must specify extents mount option. and at this point
> they must get clear that this is a way to get incompatible fs.
Think about how this will be deployed in production, long term.
If extents are not made default at some point, then no one will use the
feature, and it should not be merged.
And when extents are default, you have this blizzard-of-feature-flags
stealth upgrade event occur _sometime_ after they boot into the new fs
for the first time. And then when they want to boot another kernel,
they have to dig down a feature matrix, and figure out which ext3
codebase will work for them.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:56 ` Jeff Garzik
@ 2006-06-09 16:07 ` Alex Tomas
2006-06-09 16:09 ` [Ext2-devel] " Jeff Garzik
2006-06-09 18:04 ` Matthew Frost
2006-06-09 20:52 ` Stephen C. Tweedie
1 sibling, 2 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 16:07 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
>>>>> Jeff Garzik (JG) writes:
JG> Think about how this will be deployed in production, long term.
JG> If extents are not made default at some point, then no one will use
JG> the feature, and it should not be merged.
sorry, I disagree. for example, NUMA isn't default and shouldn't be.
but we have it in the tree and any one may choose to use it. the same
with extents. let's have it in. but let's make clear it's experimental,
it makes sense for large files only, it isn't backward compatible and
so on.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:07 ` Alex Tomas
@ 2006-06-09 16:09 ` Jeff Garzik
2006-06-09 18:04 ` Matthew Frost
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:09 UTC (permalink / raw)
To: Alex Tomas
Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> Think about how this will be deployed in production, long term.
>
> JG> If extents are not made default at some point, then no one will use
> JG> the feature, and it should not be merged.
>
> sorry, I disagree. for example, NUMA isn't default and shouldn't be.
> but we have it in the tree and any one may choose to use it. the same
> with extents. let's have it in. but let's make clear it's experimental,
> it makes sense for large files only, it isn't backward compatible and
> so on.
NUMA _is_ on by default, in newer hardware kernels :) K8 is NUMA by
default, remember.
But anyway... the "it's experimental" argument is _completely_
irrelevant. You have to think about the day when it is not, and how
that will get deployed, and what are the potential problems that will
arise from deployment.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:07 ` Alex Tomas
2006-06-09 16:09 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 18:04 ` Matthew Frost
2006-06-09 18:10 ` Alex Tomas
2006-06-09 18:14 ` [Ext2-devel] " Andreas Dilger
1 sibling, 2 replies; 295+ messages in thread
From: Matthew Frost @ 2006-06-09 18:04 UTC (permalink / raw)
To: Alex Tomas
Cc: Jeff Garzik, Linus Torvalds, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> Think about how this will be deployed in production, long term.
>
> JG> If extents are not made default at some point, then no one will use
> JG> the feature, and it should not be merged.
>
> sorry, I disagree. for example, NUMA isn't default and shouldn't be.
> but we have it in the tree and any one may choose to use it.
NUMA is designed to cope with a hardware feature, which not everybody
has. Filesystem upgrades are not qualitatively similar; it does not
depend on one's hardware design as to whether one uses ext3, let alone
extents. Your logic is faulty.
the same
> with extents. let's have it in. but let's make clear it's experimental,
> it makes sense for large files only, it isn't backward compatible and
> so on.
>
> thanks, Alex
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:04 ` Matthew Frost
@ 2006-06-09 18:10 ` Alex Tomas
2006-06-09 18:14 ` [Ext2-devel] " Andreas Dilger
1 sibling, 0 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 18:10 UTC (permalink / raw)
To: artusemrys
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger
>>>>> Matthew Frost (MF) writes:
MF> Alex Tomas wrote:
>>>>>>> Jeff Garzik (JG) writes:
JG> Think about how this will be deployed in production, long term.
JG> If extents are not made default at some point, then no one will
>> use
JG> the feature, and it should not be merged.
>> sorry, I disagree. for example, NUMA isn't default and shouldn't be.
>> but we have it in the tree and any one may choose to use it.
MF> NUMA is designed to cope with a hardware feature, which not everybody
MF> has. Filesystem upgrades are not qualitatively similar; it does not
MF> depend on one's hardware design as to whether one uses ext3, let alone
MF> extents. Your logic is faulty.
proposed 48bit extents patch addresses 2TB limit.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:04 ` Matthew Frost
2006-06-09 18:10 ` Alex Tomas
@ 2006-06-09 18:14 ` Andreas Dilger
2006-06-09 18:51 ` Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 18:14 UTC (permalink / raw)
To: Matthew Frost
Cc: Alex Tomas, Jeff Garzik, Linus Torvalds, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
On Jun 09, 2006 13:04 -0500, Matthew Frost wrote:
> Alex Tomas wrote:
> >sorry, I disagree. for example, NUMA isn't default and shouldn't be.
> >but we have it in the tree and any one may choose to use it.
>
> NUMA is designed to cope with a hardware feature, which not everybody
> has. Filesystem upgrades are not qualitatively similar; it does not
> depend on one's hardware design as to whether one uses ext3, let alone
> extents. Your logic is faulty.
If you have a > 8TB block device (which is common in large RAID devices
today, will be a single disk in a couple of years) then it is important
that your filesystem work with this block device.
If ext2 and ext3 didn't support > 2GB files (which was a filesystem
feature added in exactly the same way as extents are today, and nobody
bitched about it then) then they would be relegated to the same status
as minix and xiafs and all the other filesystems that are stuck in the
"we can't change" or "we aren't supported" camps.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:14 ` [Ext2-devel] " Andreas Dilger
@ 2006-06-09 18:51 ` Jeff Garzik
2006-06-09 19:39 ` Gerrit Huizenga
` (3 more replies)
0 siblings, 4 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:51 UTC (permalink / raw)
To: Matthew Frost, Alex Tomas, Jeff Garzik, Linus Torvalds,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
Andreas Dilger wrote:
> On Jun 09, 2006 13:04 -0500, Matthew Frost wrote:
>> Alex Tomas wrote:
>>> sorry, I disagree. for example, NUMA isn't default and shouldn't be.
>>> but we have it in the tree and any one may choose to use it.
>> NUMA is designed to cope with a hardware feature, which not everybody
>> has. Filesystem upgrades are not qualitatively similar; it does not
>> depend on one's hardware design as to whether one uses ext3, let alone
>> extents. Your logic is faulty.
>
> If you have a > 8TB block device (which is common in large RAID devices
> today, will be a single disk in a couple of years) then it is important
> that your filesystem work with this block device.
>
> If ext2 and ext3 didn't support > 2GB files (which was a filesystem
> feature added in exactly the same way as extents are today, and nobody
> bitched about it then) then they would be relegated to the same status
> as minix and xiafs and all the other filesystems that are stuck in the
> "we can't change" or "we aren't supported" camps.
PRECISELY. So you should stop modifying a filesystem whose design is
admittedly _not_ modern!
ext3 is already essentially xiafs-on-life-support, when you consider
today's large storage systems and today's filesystem technology. Just
look at the ugly hacks needed to support expanding an ext3 filesystem
online.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:51 ` Jeff Garzik
@ 2006-06-09 19:39 ` Gerrit Huizenga
2006-06-09 19:45 ` [Ext2-devel] " Jeff Garzik
2006-06-10 10:03 ` Christoph Hellwig
2006-06-09 19:49 ` [Ext2-devel] " Theodore Tso
` (2 subsequent siblings)
3 siblings, 2 replies; 295+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 19:39 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Matthew Frost, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas
On Fri, 09 Jun 2006 14:51:55 EDT, Jeff Garzik wrote:
>
> PRECISELY. So you should stop modifying a filesystem whose design is
> admittedly _not_ modern!
So just how long do you think it would take to get a modern filesystem
into the hands of real users, supported by the distros? From community
building, through design, development, testing, delivery?
gerrit
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 19:39 ` Gerrit Huizenga
@ 2006-06-09 19:45 ` Jeff Garzik
2006-06-09 20:38 ` Gerrit Huizenga
2006-06-10 10:03 ` Christoph Hellwig
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:45 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
Gerrit Huizenga wrote:
> On Fri, 09 Jun 2006 14:51:55 EDT, Jeff Garzik wrote:
>> PRECISELY. So you should stop modifying a filesystem whose design is
>> admittedly _not_ modern!
>
> So just how long do you think it would take to get a modern filesystem
> into the hands of real users, supported by the distros? From community
> building, through design, development, testing, delivery?
Start from a known working point, and keep it working...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:45 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 20:38 ` Gerrit Huizenga
0 siblings, 0 replies; 295+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 20:38 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Matthew Frost, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas
On Fri, 09 Jun 2006 15:45:16 EDT, Jeff Garzik wrote:
> Gerrit Huizenga wrote:
> > On Fri, 09 Jun 2006 14:51:55 EDT, Jeff Garzik wrote:
> >> PRECISELY. So you should stop modifying a filesystem whose design is
> >> admittedly _not_ modern!
> >
> > So just how long do you think it would take to get a modern filesystem
> > into the hands of real users, supported by the distros? From community
> > building, through design, development, testing, delivery?
>
> Start from a known working point, and keep it working...
Then clone all the user level packages, work with distros to get
the new packages included, update the man pages, get those included,
make sure bug fixes for ext2 get propagated to ext4 - oh, and those
for ext3 as well. And then work with mainline to decide when to
change from EXPERIMENTAL to stable, then decide how to get enough
users to make sure the testing is good enough, then work with the
distros to enable, then work with them to agree to provide support
to their most important, biggest, highest risk customers with this
new filesystem used by only 20 people because it isn't the default.
The repeat this whole discussion with each new feature proposed for
ext4 over the next 5 years, watch developers get disillusioned yet
again, watch 4 new competing filesystems pop up and try to be the
next great filesystem. Watch them all fade away as the ultimately
battle for mindshare wears them out and the ever cascading war between
stability and support versus new features brings us back to where we
are again today.
Or just add the feature that the entire ext3 development community
thinks is stable enough to move forward, is well enough integrated
with the existing code to *not* be a bolt on, and is incrementally
small enough to be managed by its very own developer community
without the overhead of splitting that community even further.
The short words sound good but in reality we should all have lived
through this long enough to know better.
gerrit
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:39 ` Gerrit Huizenga
2006-06-09 19:45 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-10 10:03 ` Christoph Hellwig
1 sibling, 0 replies; 295+ messages in thread
From: Christoph Hellwig @ 2006-06-10 10:03 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas
On Fri, Jun 09, 2006 at 12:39:19PM -0700, Gerrit Huizenga wrote:
> > PRECISELY. So you should stop modifying a filesystem whose design is
> > admittedly _not_ modern!
>
> So just how long do you think it would take to get a modern filesystem
> into the hands of real users, supported by the distros? From community
> building, through design, development, testing, delivery?
JFS is pretty nice because it has many adavanced features but still is
rather simple. XFS has even more cool features such as a WIP parallel
fsck and is proven on the biggest filesystems on COS operating systems
out there, but as a disadvantage is hugely complex so outsiders have a
hard time getting into it.
So shortem the option I'd recommend is to start supporting XFS more broadly,
because it's the high end filesystem that's out there today and fill the
needs people have in the next five or so years.
For the time after that we need to think about something that can scale
aswell and better while beeing simpler. Also we need to start thinking
about a clustered filesystem more, it might or might not make sense to
have a cluster filesystem also do the next generation local filesystem
thing. I'd probably start designing such a next gen fs by taking jfs
and revamping it completely.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:51 ` Jeff Garzik
2006-06-09 19:39 ` Gerrit Huizenga
@ 2006-06-09 19:49 ` Theodore Tso
2006-06-09 20:04 ` Jeff Garzik
2006-06-11 16:02 ` Arjan van de Ven
2006-06-12 22:06 ` [Ext2-devel] " Pavel Machek
3 siblings, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 19:49 UTC (permalink / raw)
To: Jeff Garzik
Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
On Fri, Jun 09, 2006 at 02:51:55PM -0400, Jeff Garzik wrote:
> ext3 is already essentially xiafs-on-life-support, when you consider
> today's large storage systems and today's filesystem technology. Just
> look at the ugly hacks needed to support expanding an ext3 filesystem
> online.
And what ugly hacks are you talking about? It's actually quite clean;
with the latest e2fsprogs, you use the same command (resize2fs) for
doing both online and offline resizing.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:49 ` [Ext2-devel] " Theodore Tso
@ 2006-06-09 20:04 ` Jeff Garzik
2006-06-09 20:57 ` Stephen C. Tweedie
2006-06-09 22:37 ` [Ext2-devel] " Andreas Dilger
0 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:04 UTC (permalink / raw)
To: Theodore Tso, Jeff Garzik, Matthew Frost, Alex Tomas,
Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel
Theodore Tso wrote:
> On Fri, Jun 09, 2006 at 02:51:55PM -0400, Jeff Garzik wrote:
>> ext3 is already essentially xiafs-on-life-support, when you consider
>> today's large storage systems and today's filesystem technology. Just
>> look at the ugly hacks needed to support expanding an ext3 filesystem
>> online.
>
> And what ugly hacks are you talking about? It's actually quite clean;
> with the latest e2fsprogs, you use the same command (resize2fs) for
> doing both online and offline resizing.
Consider a blkdev of size S1. Using LVM we increase that value under
the hood to size S2, where S2 > S1. We perform an online resize from
size S1 to S2. The size and alignment of any new groups added will
different from the non-resize case, where mke2fs was run directly on a
blkdev of size S2.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:04 ` Jeff Garzik
@ 2006-06-09 20:57 ` Stephen C. Tweedie
2006-06-09 21:49 ` Jeff Garzik
2006-06-09 22:37 ` [Ext2-devel] " Andreas Dilger
1 sibling, 1 reply; 295+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 20:57 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Theodore Ts'o, Matthew Frost, Stephen Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
Hi,
On Fri, 2006-06-09 at 16:04 -0400, Jeff Garzik wrote:
> Consider a blkdev of size S1. Using LVM we increase that value under
> the hood to size S2, where S2 > S1. We perform an online resize from
> size S1 to S2. The size and alignment of any new groups added will
> different from the non-resize case, where mke2fs was run directly on a
> blkdev of size S2.
No, they won't. We simply grow the last block group in the filesystem
up to the size where we'd naturally add another block group anyway; and
then, we add another block group exactly where it would have been on a
fresh mkfs.
--Stephen
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:57 ` Stephen C. Tweedie
@ 2006-06-09 21:49 ` Jeff Garzik
2006-06-09 21:55 ` [Ext2-devel] " Stephen C. Tweedie
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:49 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Andrew Morton, Theodore Ts'o, Matthew Frost,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2006-06-09 at 16:04 -0400, Jeff Garzik wrote:
>
>> Consider a blkdev of size S1. Using LVM we increase that value under
>> the hood to size S2, where S2 > S1. We perform an online resize from
>> size S1 to S2. The size and alignment of any new groups added will
>> different from the non-resize case, where mke2fs was run directly on a
>> blkdev of size S2.
>
> No, they won't. We simply grow the last block group in the filesystem
> up to the size where we'd naturally add another block group anyway; and
> then, we add another block group exactly where it would have been on a
> fresh mkfs.
Yes but the inodes per group etc. would differ.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 21:49 ` Jeff Garzik
@ 2006-06-09 21:55 ` Stephen C. Tweedie
2006-06-09 23:44 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 21:55 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Theodore Ts'o, Matthew Frost,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas, Stephen Tweedie
Hi,
On Fri, 2006-06-09 at 17:49 -0400, Jeff Garzik wrote:
> >> Consider a blkdev of size S1. Using LVM we increase that value under
> >> the hood to size S2, where S2 > S1. We perform an online resize from
> >> size S1 to S2. The size and alignment of any new groups added will
> >> different from the non-resize case, where mke2fs was run directly on a
> >> blkdev of size S2.
> >
> > No, they won't. We simply grow the last block group in the filesystem
> > up to the size where we'd naturally add another block group anyway; and
> > then, we add another block group exactly where it would have been on a
> > fresh mkfs.
>
> Yes but the inodes per group etc. would differ.
No, we add the same number of inodes in the new groups that all the
previous groups have.
--Stephen
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:55 ` [Ext2-devel] " Stephen C. Tweedie
@ 2006-06-09 23:44 ` Jeff Garzik
2006-06-10 0:45 ` [Ext2-devel] " Andreas Dilger
2006-06-10 0:47 ` Theodore Tso
0 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 23:44 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Andrew Morton, Theodore Ts'o, Matthew Frost,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2006-06-09 at 17:49 -0400, Jeff Garzik wrote:
>
>>>> Consider a blkdev of size S1. Using LVM we increase that value under
>>>> the hood to size S2, where S2 > S1. We perform an online resize from
>>>> size S1 to S2. The size and alignment of any new groups added will
>>>> different from the non-resize case, where mke2fs was run directly on a
>>>> blkdev of size S2.
>>> No, they won't. We simply grow the last block group in the filesystem
>>> up to the size where we'd naturally add another block group anyway; and
>>> then, we add another block group exactly where it would have been on a
>>> fresh mkfs.
>> Yes but the inodes per group etc. would differ.
>
> No, we add the same number of inodes in the new groups that all the
> previous groups have.
Yes. Re-read what I wrote. To put it another way, "mkfs S1 + resize to
S2" does not produce precisely the same layout as "mkfs S2".
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 23:44 ` Jeff Garzik
@ 2006-06-10 0:45 ` Andreas Dilger
2006-06-10 0:47 ` Theodore Tso
1 sibling, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 0:45 UTC (permalink / raw)
To: Jeff Garzik
Cc: Stephen C. Tweedie, Andrew Morton, Theodore Ts'o,
Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas
On Jun 09, 2006 19:44 -0400, Jeff Garzik wrote:
> Stephen C. Tweedie wrote:
> > No, we add the same number of inodes in the new groups that all the
> > previous groups have.
>
> Yes. Re-read what I wrote. To put it another way, "mkfs S1 + resize to
> S2" does not produce precisely the same layout as "mkfs S2".
And in what way is that important? I mean, really, if this is your argument
that ext3 online resizing is a "hack" then it is pretty weak. This does
not affect the operation or compatibility of the resized filesystem all the
way back to the stone age (i.e. every single ext2 kernel ever will work
with the resized filesystem). That is why online resizing (and the resize
inode) are a COMPAT feature.
If I "cp b a /mnt/newfs" and "cp a b /mnt/newfs" "a" and "b" will have
different inode numbers too, but doesn't mean that "cp" is a "hack".
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 23:44 ` Jeff Garzik
2006-06-10 0:45 ` [Ext2-devel] " Andreas Dilger
@ 2006-06-10 0:47 ` Theodore Tso
2006-06-10 1:09 ` Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-10 0:47 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
On Fri, Jun 09, 2006 at 07:44:44PM -0400, Jeff Garzik wrote:
> Yes. Re-read what I wrote. To put it another way, "mkfs S1 + resize to
> S2" does not produce precisely the same layout as "mkfs S2".
Different in the same way that "mke2fs -E stride=5" results a slightly
different location of where the block bitmaps, inode bitmaps, and
inode table might be, yes --- but SO WHAT?
There's a *reason* that the block group descriptors tell the kernel
where to find the block/inode bitmaps and the inode table. They can
change due to bad blocks in the filesystem, or requests to subtly
change the layout to optimize various RAID layouts, for example. And
exactly how the block/inode bitmaps would get laid out in response to
-E stride have also changed over time, depending on which version of
e2fsprogs, but ---- News flash!! --- it doesn't matter!!!
Jeff, you seem to think that the fact that the layout isn't precisely
the same after an on-line resizing is proof of something horrible, but
it isn't. The exact location of filesystem metadata has never been
fixed, not in the past ten years of ext2/3 history, and this is not a
big deal. It certainly isn't "proof" of on-line resizing being
something horrible, as you keep trying to claim, without any arguments
other than, "The layout is different!".
Oh my, hide the women and children...
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 0:47 ` Theodore Tso
@ 2006-06-10 1:09 ` Jeff Garzik
2006-06-10 1:30 ` [Ext2-devel] " Andreas Dilger
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 1:09 UTC (permalink / raw)
To: Theodore Tso, Jeff Garzik, Stephen C. Tweedie, Andrew Morton,
Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas
Theodore Tso wrote:
> Jeff, you seem to think that the fact that the layout isn't precisely
> the same after an on-line resizing is proof of something horrible, but
> it isn't. The exact location of filesystem metadata has never been
> fixed, not in the past ten years of ext2/3 history, and this is not a
> big deal. It certainly isn't "proof" of on-line resizing being
> something horrible, as you keep trying to claim, without any arguments
> other than, "The layout is different!".
No, I was proving merely that it is _different_. And the values where
you see a _difference_ are the ones of which are no longer sized
optimally, after you grow the fs to a larger size.
So you incur a performance penalty for resizing to size S2, rather than
mke2fs'ing the new blkdev at size S2. Certainly within the confines of
ext3 that cannot be helped, but a different inode allocation strategy
could improve upon that.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 1:09 ` Jeff Garzik
@ 2006-06-10 1:30 ` Andreas Dilger
2006-06-10 1:43 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 1:30 UTC (permalink / raw)
To: Jeff Garzik
Cc: Theodore Tso, Stephen C. Tweedie, Andrew Morton, Matthew Frost,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
On Jun 09, 2006 21:09 -0400, Jeff Garzik wrote:
> Theodore Tso wrote:
> > Jeff, you seem to think that the fact that the layout isn't precisely
> > the same after an on-line resizing is proof of something horrible, but
> > it isn't. The exact location of filesystem metadata has never been
> > fixed, not in the past ten years of ext2/3 history, and this is not a
> > big deal. It certainly isn't "proof" of on-line resizing being
> > something horrible, as you keep trying to claim, without any arguments
> > other than, "The layout is different!".
>
> No, I was proving merely that it is _different_. And the values where
> you see a _difference_ are the ones of which are no longer sized
> optimally, after you grow the fs to a larger size.
It sounds like you don't know what you are talking about, which is OK,
except that you keep harping on some non-existent point.
> So you incur a performance penalty for resizing to size S2, rather than
> mke2fs'ing the new blkdev at size S2. Certainly within the confines of
> ext3 that cannot be helped, but a different inode allocation strategy
> could improve upon that.
??? Can you please be specific in what the performance penalty is, and
what specifically is "not sized optimally" after a resize? How exactly
does inode allocation strategy relate to anything at all to online resizing.
Given that Ted and I are both disagreeing with you, and we are the two
people who know the most about the online resizing code (SCT is also
in this same group), maybe you should just concede that you are incorrect
on this point and move on.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 1:30 ` [Ext2-devel] " Andreas Dilger
@ 2006-06-10 1:43 ` Jeff Garzik
2006-06-10 2:03 ` Theodore Tso
2006-06-10 2:26 ` Andreas Dilger
0 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 1:43 UTC (permalink / raw)
To: Jeff Garzik, Theodore Tso, Stephen C. Tweedie, Andrew Morton,
Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas
Andreas Dilger wrote:
> On Jun 09, 2006 21:09 -0400, Jeff Garzik wrote:
>> Theodore Tso wrote:
>>> Jeff, you seem to think that the fact that the layout isn't precisely
>>> the same after an on-line resizing is proof of something horrible, but
>>> it isn't. The exact location of filesystem metadata has never been
>>> fixed, not in the past ten years of ext2/3 history, and this is not a
>>> big deal. It certainly isn't "proof" of on-line resizing being
>>> something horrible, as you keep trying to claim, without any arguments
>>> other than, "The layout is different!".
>> No, I was proving merely that it is _different_. And the values where
>> you see a _difference_ are the ones of which are no longer sized
>> optimally, after you grow the fs to a larger size.
>
> It sounds like you don't know what you are talking about, which is OK,
> except that you keep harping on some non-existent point.
>
>> So you incur a performance penalty for resizing to size S2, rather than
>> mke2fs'ing the new blkdev at size S2. Certainly within the confines of
>> ext3 that cannot be helped, but a different inode allocation strategy
>> could improve upon that.
>
> ??? Can you please be specific in what the performance penalty is, and
> what specifically is "not sized optimally" after a resize? How exactly
> does inode allocation strategy relate to anything at all to online resizing.
Inodes per group / inode blocks per group, as I've already stated.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 1:43 ` Jeff Garzik
@ 2006-06-10 2:03 ` Theodore Tso
2006-06-10 2:11 ` [Ext2-devel] " Jeff Garzik
2006-06-10 2:58 ` [Ext2-devel] " Jeff Garzik
2006-06-10 2:26 ` Andreas Dilger
1 sibling, 2 replies; 295+ messages in thread
From: Theodore Tso @ 2006-06-10 2:03 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
On Fri, Jun 09, 2006 at 09:43:14PM -0400, Jeff Garzik wrote:
> >??? Can you please be specific in what the performance penalty is, and
> >what specifically is "not sized optimally" after a resize? How exactly
> >does inode allocation strategy relate to anything at all to online
> >resizing.
>
> Inodes per group / inode blocks per group, as I've already stated.
Nope! Inodes per group and inode blocks per group are maintained
across an online resize. So there is no difference in inodes per
group for a filesystem created at size S1 and resized to size S2
(using either an on-line or off-line resize), and a filesystem which
is created to be size S2.
As Andreas has said, "you don't know what you are talking about."
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 2:03 ` Theodore Tso
@ 2006-06-10 2:11 ` Jeff Garzik
2006-06-10 2:54 ` Theodore Tso
2006-06-10 2:58 ` [Ext2-devel] " Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 2:11 UTC (permalink / raw)
To: Theodore Tso, Jeff Garzik, Stephen C. Tweedie, Andrew Morton,
Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas
Theodore Tso wrote:
> On Fri, Jun 09, 2006 at 09:43:14PM -0400, Jeff Garzik wrote:
>>> ??? Can you please be specific in what the performance penalty is, and
>>> what specifically is "not sized optimally" after a resize? How exactly
>>> does inode allocation strategy relate to anything at all to online
>>> resizing.
>> Inodes per group / inode blocks per group, as I've already stated.
>
> Inodes per group and inode blocks per group are maintained
> across an online resize.
That's the problem I'm pointing out.
> So there is no difference in inodes per
> group for a filesystem created at size S1 and resized to size S2
> (using either an on-line or off-line resize), and a filesystem which
> is created to be size S2.
Trivial to prove false, by your statement above if nothing else. But
anyway:
Run mke2fs on a blkdev of size 500MB, and one of 500GB. Note values.
Now resize blkdev formatted for size 500MB to 500GB, and note differences.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 2:11 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-10 2:54 ` Theodore Tso
2006-06-10 3:11 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-10 2:54 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
On Fri, Jun 09, 2006 at 10:11:59PM -0400, Jeff Garzik wrote:
> Trivial to prove false, by your statement above if nothing else. But
> anyway:
> Run mke2fs on a blkdev of size 500MB, and one of 500GB. Note values.
> Now resize blkdev formatted for size 500MB to 500GB, and note differences.
OK, so *that's* what you were trying to get at. I wish you had said
that from the first, since most people who are creating filesystems to
resize (i.e., on LVM or RAID systems), don't start them as small as
500MB.
Yes, the default inode ratio and blocksize is different for
filesystems under 512MB. But that's largely irrelevant for the use
cases of online resizing, where people will generally be starting with
a filesystem *far* larger than 512megs. They might starting with an
LVM sized to be 2 gigs and resize it to 5 gigs. Or 100 gigs and
resizing it 200 gigs; or 500gigs; or a terrabyte. In all of those
cases, the results are identical.
It also by the way has nothing to do with the "inode allocation
algorithm", as you caleimd. The biggest difference will come from the
use of a 1k blocksize instead of 4k blocksize, but that's a matter of
the defaults that were selected for "small" filesystems. If someone
was creating a file system that they knew they were likely to resize
to 500GB, they could always create it with an explicitly specified
blocksize of 4k, and also specify a different inode ratio.
And this is your argument that on-line resizing is a horrible hack,
and ext3 should be thrown out and rewritten from scratch? That's
weak.
One other thought --- people do *care* about backwards compatibility
from a filesystem format level, and they do appreciate being able to
easily upgrade and take advantage of new filesystem features without
needing to do a dump/restore.
If you don't care about compatibility, but want a scalable filesystem,
take a look at JFS. It's very, very, good at what it does (and has
support for extents and large block numbers) --- and it's smaller than
XFS and doesn't have the VNODE and System V/IRIX API compatibility
crud of XFS. The only downside with it is that you do have to do a
backup, reformat, and restore, and of course, the lack of support from
pretty much all of the major distributions.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 2:54 ` Theodore Tso
@ 2006-06-10 3:11 ` Jeff Garzik
2006-06-10 12:15 ` Theodore Tso
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 3:11 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
Theodore Tso wrote:
> And this is your argument that on-line resizing is a horrible hack,
It's an example of ext2 being bandaided to do something it was never
originally designed to do. If online resizing had been planned from the
start, allocating new inode tables on the fly would be trivial, as it is
in JFS/NTFS/...
> and ext3 should be thrown out and rewritten from scratch?
Blatant and silly exaggeration. Re-read the thread, and note how many
times "cp -a ext3 ext4" was written.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 3:11 ` Jeff Garzik
@ 2006-06-10 12:15 ` Theodore Tso
2006-06-10 14:31 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-10 12:15 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
On Fri, Jun 09, 2006 at 11:11:31PM -0400, Jeff Garzik wrote:
> It's an example of ext2 being bandaided to do something it was never
> originally designed to do. If online resizing had been planned from the
> start, allocating new inode tables on the fly would be trivial, as it is
> in JFS/NTFS/...
And once again this has *nothing* to do with inode allocation, or
dynamic allocation of inode tables. Your "performance issue" has to
do with a difference in blocksizes. If you ext2/3 to pass your silly
test, then upgrade to the latest e2fsprogs and install the following
/etc/mke2fs.conf:
[defaults]
base_features = sparse_super,filetype,resize_inode,dir_index
blocksize = 4096
inode_ratio = 8192
[fs_types]
small = {
blocksize = 4096
inode_ratio = 8192
}
floppy = {
blocksize = 4096
inode_ratio = 8192
}
Happy now?
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 12:15 ` Theodore Tso
@ 2006-06-10 14:31 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 14:31 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, Matthew Frost, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
Theodore Tso wrote:
> On Fri, Jun 09, 2006 at 11:11:31PM -0400, Jeff Garzik wrote:
>> It's an example of ext2 being bandaided to do something it was never
>> originally designed to do. If online resizing had been planned from the
>> start, allocating new inode tables on the fly would be trivial, as it is
>> in JFS/NTFS/...
>
> And once again this has *nothing* to do with inode allocation, or
> dynamic allocation of inode tables. Your "performance issue" has to
> do with a difference in blocksizes. If you ext2/3 to pass your silly
> test, then upgrade to the latest e2fsprogs and install the following
> /etc/mke2fs.conf:
WTF? In none of my examples did block size ever change. In none of my
examples was block size ever mentioned as a factor.
Inode density was demonstrably different in the resize vs. mkfs cases.
And online resize -obviously- imposes a limit on inode density, by
locking inodes-per-group at fs creation time. Dynamic allocation of
inode tables would permit dynamic sizing of inode tables based on
current needs, rather than needs determined at fs creation time.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 2:03 ` Theodore Tso
2006-06-10 2:11 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-10 2:58 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 2:58 UTC (permalink / raw)
To: Theodore Tso, Andreas Dilger, Stephen C. Tweedie
Cc: Andrew Morton, Matthew Frost, ext2-devel@lists.sourceforge.net,
linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
Alex Tomas
Theodore Tso wrote:
> Inodes per group and inode blocks per group are maintained
> across an online resize. So there is no difference in inodes per
> group for a filesystem created at size S1 and resized to size S2
> (using either an on-line or off-line resize), and a filesystem which
> is created to be size S2.
Here are real numbers, which illustrate how the above two statements
contradict, and how the second statement is false:
blkdev A, formatted with a 50MB filesystem
block size 4096
block count 12800 (size S1)
inodes per group 12800
blkdev A, formatted to full capacity (~350GB)
block size 4096
block count 95472256 (size S2)
inodes per group 32768
Case 1: online resize from 50MB to 350GB
Result: inodes per group == 12800 (it remains the same)
Case 2: mke2fs blkdev A, with no block-count restrictions
Result: inodes per group == 32768
Thus, each inode group holds fewer inodes per group in case #1 than #2.
Thus, case #2 has greater inode density than case #1.
Overall,
a) mke2fs chooses optimal values based on creation-time block count
b) online resize does not change these values
thus the values are no longer optimal. And in this case, they are never
-more- optimal, and potentially -less- optimal.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 1:43 ` Jeff Garzik
2006-06-10 2:03 ` Theodore Tso
@ 2006-06-10 2:26 ` Andreas Dilger
2006-06-10 2:31 ` Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 2:26 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Theodore Tso, Matthew Frost, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
On Jun 09, 2006 21:43 -0400, Jeff Garzik wrote:
> >??? Can you please be specific in what the performance penalty is, and
> >what specifically is "not sized optimally" after a resize? How exactly
> >does inode allocation strategy relate to anything at all to online
> >resizing.
>
> Inodes per group / inode blocks per group, as I've already stated.
As Stepen and Ted already replied (though I can understand if you missed
it, it seems this is a popular thread :-)- the inode count per group
is a fixed parameter for the whole filesystem that even online resizing
cannot change.
The only things that can change on a per-group basis (with either online or
offline resizing, or with mke2fs -R stride=N, or if there are bad block
on disk) is that the relative offset within the group of the inode and
block bitmaps can change, and the relative location of the inode table
within the group can change. The size of the inode table per group (and
hence number of inodes per group) is always constant, since it is stored
in the superblock and affects the inode number->group mapping.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 2:26 ` Andreas Dilger
@ 2006-06-10 2:31 ` Jeff Garzik
2006-06-10 4:22 ` Andreas Dilger
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 2:31 UTC (permalink / raw)
To: Jeff Garzik, Theodore Tso, Stephen C. Tweedie, Andrew Morton,
Matthew Frost, ext2-devel@lists.sourceforge.net, linux-kernel,
Linus Torvalds, Mingming Cao, linux-fsdevel, Alex Tomas
Andreas Dilger wrote:
> the inode count per group
> is a fixed parameter for the whole filesystem that even online resizing
> cannot change.
Correct. Fixed... at mke2fs time. Thus, with varying mke2fs runs,
inodes-per-group can vary, where it does not with online resize.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 2:31 ` Jeff Garzik
@ 2006-06-10 4:22 ` Andreas Dilger
0 siblings, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 4:22 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Theodore Tso, Matthew Frost, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, Alex Tomas
On Jun 09, 2006 22:31 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >the inode count per group
> >is a fixed parameter for the whole filesystem that even online resizing
> >cannot change.
>
> Correct. Fixed... at mke2fs time. Thus, with varying mke2fs runs,
> inodes-per-group can vary, where it does not with online resize.
Unless specified differently at format time, the inodes-per-group will
be the same value (namely 16384) if the filesystem is larger than 512MB.
So, yes, I agree with you if you start with a tiny filesystem and try
to resize it to a gigantic filesystem you will get a different number
of inodes, but that is true whether this is online resizing or offline.
That said, for anyone who has resized their filesystem I think they prefer
to be able to resize it than not being able to do so at all.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 20:04 ` Jeff Garzik
2006-06-09 20:57 ` Stephen C. Tweedie
@ 2006-06-09 22:37 ` Andreas Dilger
1 sibling, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 22:37 UTC (permalink / raw)
To: Jeff Garzik
Cc: Theodore Tso, Matthew Frost, Alex Tomas, Linus Torvalds,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
On Jun 09, 2006 16:04 -0400, Jeff Garzik wrote:
> Theodore Tso wrote:
> > And what ugly hacks are you talking about? It's actually quite clean;
> > with the latest e2fsprogs, you use the same command (resize2fs) for
> > doing both online and offline resizing.
>
> Consider a blkdev of size S1. Using LVM we increase that value under
> the hood to size S2, where S2 > S1. We perform an online resize from
> size S1 to S2. The size and alignment of any new groups added will
> different from the non-resize case, where mke2fs was run directly on a
> blkdev of size S2.
Umm, and how is that a problem? Either you want online resizing because
it provides some useful functionality, or you don't want it because you
are concerned with something that nobody else in the world is. In the
latter case, don't use it. Even if the metadata alignment is slightly
different on disk doesn't make it in any way an invalid filesystem. In
fact, online resizing is 100% compatible after the resize back to the
dark ages of linux.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:51 ` Jeff Garzik
2006-06-09 19:39 ` Gerrit Huizenga
2006-06-09 19:49 ` [Ext2-devel] " Theodore Tso
@ 2006-06-11 16:02 ` Arjan van de Ven
2006-06-11 16:30 ` Nikita Danilov
2006-06-12 6:35 ` Andreas Dilger
2006-06-12 22:06 ` [Ext2-devel] " Pavel Machek
3 siblings, 2 replies; 295+ messages in thread
From: Arjan van de Ven @ 2006-06-11 16:02 UTC (permalink / raw)
To: Jeff Garzik
Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote:
> PRECISELY. So you should stop modifying a filesystem whose design is
> admittedly _not_ modern!
>
> ext3 is already essentially xiafs-on-life-support, when you consider
> today's large storage systems and today's filesystem technology. Just
> look at the ugly hacks needed to support expanding an ext3 filesystem
> online.
actually I think I disagree with you. One thing I've noticed over the
years is that ext2 layout has one thing going for it: it is simple and
robust. Maybe "ext2 layout" is the wrong word, "block bitmap and
direct/indirect block based" may be better. It seems that once you go
into tree space (and I would call htree a borderline thing there) you
get both really complex code and fragile behavior all over (mostly in
terms of "when something goes wrong")
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-11 16:02 ` Arjan van de Ven
@ 2006-06-11 16:30 ` Nikita Danilov
2006-06-11 16:55 ` [Ext2-devel] " Arjan van de Ven
2006-06-12 6:35 ` Andreas Dilger
1 sibling, 1 reply; 295+ messages in thread
From: Nikita Danilov @ 2006-06-11 16:30 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Andrew Morton, Matthew Frost, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas
Arjan van de Ven writes:
> On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote:
> > PRECISELY. So you should stop modifying a filesystem whose design is
> > admittedly _not_ modern!
> >
> > ext3 is already essentially xiafs-on-life-support, when you consider
> > today's large storage systems and today's filesystem technology. Just
> > look at the ugly hacks needed to support expanding an ext3 filesystem
> > online.
>
>
> actually I think I disagree with you. One thing I've noticed over the
> years is that ext2 layout has one thing going for it: it is simple and
> robust. Maybe "ext2 layout" is the wrong word, "block bitmap and
> direct/indirect block based" may be better. It seems that once you go
> into tree space (and I would call htree a borderline thing there) you
> get both really complex code and fragile behavior all over (mostly in
> terms of "when something goes wrong")
Huh? Direct/indirect/double-indirect/... _is_ a tree, albeit not
balanced one. What makes s5fs/ffs/ufs/ext* so exceptionally robust is
fixed position of inode tables, which provides a guaranteed starting
point for fsck under almost any circumstances.
Nikita.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-11 16:30 ` Nikita Danilov
@ 2006-06-11 16:55 ` Arjan van de Ven
0 siblings, 0 replies; 295+ messages in thread
From: Arjan van de Ven @ 2006-06-11 16:55 UTC (permalink / raw)
To: Nikita Danilov
Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
On Sun, 2006-06-11 at 20:30 +0400, Nikita Danilov wrote:
> Arjan van de Ven writes:
> > On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote:
> > > PRECISELY. So you should stop modifying a filesystem whose design is
> > > admittedly _not_ modern!
> > >
> > > ext3 is already essentially xiafs-on-life-support, when you consider
> > > today's large storage systems and today's filesystem technology. Just
> > > look at the ugly hacks needed to support expanding an ext3 filesystem
> > > online.
> >
> >
> > actually I think I disagree with you. One thing I've noticed over the
> > years is that ext2 layout has one thing going for it: it is simple and
> > robust. Maybe "ext2 layout" is the wrong word, "block bitmap and
> > direct/indirect block based" may be better. It seems that once you go
> > into tree space (and I would call htree a borderline thing there) you
> > get both really complex code and fragile behavior all over (mostly in
> > terms of "when something goes wrong")
>
> Huh? Direct/indirect/double-indirect/... _is_ a tree, albeit not
> balanced one.
ok sure; the main strength is that it is not a dynamic tree.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-11 16:02 ` Arjan van de Ven
2006-06-11 16:30 ` Nikita Danilov
@ 2006-06-12 6:35 ` Andreas Dilger
1 sibling, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-12 6:35 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas
On Jun 11, 2006 18:00 +0200, Arjan van de Ven wrote:
> On Fri, 2006-06-09 at 21:44 +0100, Alan Cox wrote:
> > OTOH the number of complaints about this is minimal, people want to go
> > forwards in a controlled manner not backwards.
>
> well... they want to be able to go "a little bit" backwards; say one
> version of an OS (6 months). Eg the scenario that ought to work is "go
> to newer version, hate it, go back". But yes that's a limited time to go
> back, not the "go back to 2.2" kind of "go back".
Interestingly, one of the reasons we want(ed) to get the extents code into
the ext3 mainline ASAP is that this would allow it to be available for the
"go back" phase when (in a couple of years) you NEED to have support for
gigantic block devices and have no choice but use this code to update.
For today it would only be used by people who really want to use it.
On Jun 11, 2006 18:02 +0200, Arjan van de Ven wrote:
> On Fri, 2006-06-09 at 14:51 -0400, Jeff Garzik wrote:
> > PRECISELY. So you should stop modifying a filesystem whose design is
> > admittedly _not_ modern!
> >
> > ext3 is already essentially xiafs-on-life-support, when you consider
> > today's large storage systems and today's filesystem technology. Just
> > look at the ugly hacks needed to support expanding an ext3 filesystem
> > online.
>
> actually I think I disagree with you. One thing I've noticed over the
> years is that ext2 layout has one thing going for it: it is simple and
> robust. Maybe "ext2 layout" is the wrong word, "block bitmap and
> direct/indirect block based" may be better. It seems that once you go
> into tree space (and I would call htree a borderline thing there) you
> get both really complex code and fragile behavior all over (mostly in
> terms of "when something goes wrong")
You're correct in calling htree a borderline case, because the directory
metadata is still accessible in a "linear" manner if the tree is corrupted
for some reason. I've recently been thinking of making the structure even
more robust by encoding a singly- or doubly-linked list into the directory
leaf blocks.
However, in the direct/indirect block tree is the most fragile part of
ext2/ext3. It also has the bad effect that corruption in the file indirect
tree can easily amplify into widespread filesystem corruption because wrongly
freeing indirect block and reallocating it will potentially cause 1024 more
blocks to be freed when that indirect block is unlinked, etc. This is also
the slowest part of e2fsck checking if it detects corruption (duplication)
in the block allocation.
When we had very small filesystems it was easy to tell if an
indirect block was corrupt, because the valid block numbers made up only
a small fraction of the 2^32 possible block numbers. However, with large
filesystems valid block numbers make up a large fraction of the 2^32 block
number space. As we get to 16TB filesystems it is impossible to tell when
an indirect block is filled with garbage and when it is valid.
One of the features of the extent format is that firstly it has a magic
number in each "indirect" block (called an extent index block). Secondly,
there is enough redundancy that it allows internal validation of the extent
data (e.g. that extents are sequentially increasing logical offsets, that
the parent's logical offset is correctly "encompassing" all of the leaf's
logical offsets.
Finally, one of the features that has been designed into the extent format
(though not yet implemented) is that it is possible to add a checksum to
each extent index to verify the metadata more strongly. There will also
be space to have a back-pointer to the parent inode for validation.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:51 ` Jeff Garzik
` (2 preceding siblings ...)
2006-06-11 16:02 ` Arjan van de Ven
@ 2006-06-12 22:06 ` Pavel Machek
2006-06-14 14:31 ` Barry K. Nathan
3 siblings, 1 reply; 295+ messages in thread
From: Pavel Machek @ 2006-06-12 22:06 UTC (permalink / raw)
To: Jeff Garzik
Cc: Matthew Frost, Alex Tomas, Linus Torvalds, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
Hi!
> >If ext2 and ext3 didn't support > 2GB files (which was
> >a filesystem
> >feature added in exactly the same way as extents are
> >today, and nobody
> >bitched about it then) then they would be relegated to
> >the same status
> >as minix and xiafs and all the other filesystems that
> >are stuck in the
> >"we can't change" or "we aren't supported" camps.
>
> PRECISELY. So you should stop modifying a filesystem
> whose design is admittedly _not_ modern!
>
> ext3 is already essentially xiafs-on-life-support, when
> you consider today's large storage systems and today's
> filesystem technology.
Please don't. AFAIK, ext2/3 is only filesystem with working fsck
(because that fsck was actually needed in the old days). Starting from
xfs/jfs/reiser/??? means we no longer have working fsck...
--
Thanks for all the (sleeping) penguins.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-12 22:06 ` [Ext2-devel] " Pavel Machek
@ 2006-06-14 14:31 ` Barry K. Nathan
2006-06-14 21:34 ` [Ext2-devel] " Pavel Machek
0 siblings, 1 reply; 295+ messages in thread
From: Barry K. Nathan @ 2006-06-14 14:31 UTC (permalink / raw)
To: Pavel Machek
Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas
On 6/12/06, Pavel Machek <pavel@ucw.cz> wrote:
> Please don't. AFAIK, ext2/3 is only filesystem with working fsck
> (because that fsck was actually needed in the old days). Starting from
> xfs/jfs/reiser/??? means we no longer have working fsck...
Er, what do you mean by "working fsck"?
Unless I'm misunderstanding something, JFS also has a working fsck
(which has actually performed successful repair of real-world
filesystem corruption for me, although I haven't used it as much as
e2fsck or xfs_repair).
XFS's fsck is a no-op, but I think it could be implemented as a
wrapper around xfs_repair (and maybe xfs_check). xfs_repair has
successfully fixed corrupted filesystems for me, just as JFS's fsck
has.
(As for ReiserFS... well, in the past it's probably been too easy to
shoot yourself in the foot with reiserfsck and make the filesystem
worse-to-nonexistent instead of better. I haven't needed to use
reiserfsck on a corrupt FS lately so I don't know how it compares
these days.)
--
-Barry K. Nathan <barryn@pobox.com>
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-14 14:31 ` Barry K. Nathan
@ 2006-06-14 21:34 ` Pavel Machek
2006-06-15 0:28 ` Barry K. Nathan
0 siblings, 1 reply; 295+ messages in thread
From: Pavel Machek @ 2006-06-14 21:34 UTC (permalink / raw)
To: Barry K. Nathan
Cc: Jeff Garzik, Matthew Frost, Alex Tomas, Linus Torvalds,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
Hi!
> >Please don't. AFAIK, ext2/3 is only filesystem with
> >working fsck
> >(because that fsck was actually needed in the old
> >days). Starting from
> >xfs/jfs/reiser/??? means we no longer have working
> >fsck...
>
> Er, what do you mean by "working fsck"?
Passes 8 hours of me trying to intentionally break it with weird,
artifical disk corruption.
I even have script somewhere.
> Unless I'm misunderstanding something, JFS also has a
> working fsck
> (which has actually performed successful repair of
> real-world
> filesystem corruption for me, although I haven't used it
> as much as
> e2fsck or xfs_repair).
...like, if it repaired 100 different, non-trivial corruptions, that
would be argument.
fsck.ext2 survives my torture (in some versions). fsck.vfat never
worked for me (likes to segfault), fsck.reiser never worked for me.
Pavel
--
Thanks for all the (sleeping) penguins.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-14 21:34 ` [Ext2-devel] " Pavel Machek
@ 2006-06-15 0:28 ` Barry K. Nathan
2006-06-15 4:55 ` Theodore Tso
2006-06-15 9:15 ` Pavel Machek
0 siblings, 2 replies; 295+ messages in thread
From: Barry K. Nathan @ 2006-06-15 0:28 UTC (permalink / raw)
To: Pavel Machek
Cc: Jeff Garzik, Matthew Frost, Alex Tomas, Linus Torvalds,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
On 6/14/06, Pavel Machek <pavel@ucw.cz> wrote:
> Passes 8 hours of me trying to intentionally break it with weird,
> artifical disk corruption.
>
> I even have script somewhere.
Ok, thanks for clarifying.
> > Unless I'm misunderstanding something, JFS also has a
> > working fsck
> > (which has actually performed successful repair of
> > real-world
> > filesystem corruption for me, although I haven't used it
> > as much as
> > e2fsck or xfs_repair).
>
> ...like, if it repaired 100 different, non-trivial corruptions, that
> would be argument.
In the case of XFS, I've repaired maybe two dozen (or so) corruptions
that might be non-trivial (in most of the cases, the filesystem
wouldn't even mount before the repair).
> fsck.ext2 survives my torture (in some versions). fsck.vfat never
> worked for me (likes to segfault), fsck.reiser never worked for me.
BTW, I actually have a test filesystem here (an e2image from an actual
filesystem I encountered once) that used to cause e2fsck 1.36/1.37 to
segfault. Strangely, more ancient versions (like what ships in Red Hat
7.2) were able to repair it without segfaulting. In a few days, once
other stuff calms down for me, I need to revisit that and see if the
bug still exists with 1.39.
--
-Barry K. Nathan <barryn@pobox.com>
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-15 0:28 ` Barry K. Nathan
@ 2006-06-15 4:55 ` Theodore Tso
2006-06-15 7:43 ` Barry K. Nathan
2006-06-15 9:15 ` Pavel Machek
1 sibling, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-15 4:55 UTC (permalink / raw)
To: Barry K. Nathan; +Cc: ext2-devel, linux-kernel, linux-fsdevel
On Wed, Jun 14, 2006 at 05:28:31PM -0700, Barry K. Nathan wrote:
> BTW, I actually have a test filesystem here (an e2image from an actual
> filesystem I encountered once) that used to cause e2fsck 1.36/1.37 to
> segfault. Strangely, more ancient versions (like what ships in Red Hat
> 7.2) were able to repair it without segfaulting. In a few days, once
> other stuff calms down for me, I need to revisit that and see if the
> bug still exists with 1.39.
Please try it with 1.39; if it still crashes, let me know --- I treat
any filesystem corruptions that causes e2fsck to crash or which e2fsck
can't fix in a single pass to be a bug. I'm guessing though that this
was probably this bug which was fixed right after 1.38 released (some
distributions did have the fix, but it's in the mainline e2fsprogs
starting with 1.39):
2005-07-04 Theodore Ts'o <tytso@mit.edu>
* pass2.c (e2fsck_process_bad_inode): Fixed bug which could cause
e2fsck to core dump if a disconnected inode contained an
extended attribute. This was actually caused by two bugs.
The first bug is that if the inode has been fully fixed
up, the code will attempt to remove the inode from the
inode_bad_map without checking to see if this bitmap is
present. Since it is cleared at the end of pass 2, if
e2fsck_process_bad_inode is called in pass 4 (as it is for
disconnected inodes), this would result in a core dump.
This bug was mostly hidden by a second bug, which caused
e2fsck_process_bad_inode() to consider all inodes without
an extended attribute to be not fixed. (Addresses Debian
Bug: #316736)
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-15 0:28 ` Barry K. Nathan
2006-06-15 4:55 ` Theodore Tso
@ 2006-06-15 9:15 ` Pavel Machek
2006-06-15 9:40 ` Barry K. Nathan
1 sibling, 1 reply; 295+ messages in thread
From: Pavel Machek @ 2006-06-15 9:15 UTC (permalink / raw)
To: Barry K. Nathan
Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas
Hi!
> >Passes 8 hours of me trying to intentionally break it with weird,
> >artifical disk corruption.
> >
> >I even have script somewhere.
>
> Ok, thanks for clarifying.
You can get a copy, it would be interesting to know how JFS/XFS does.
> >> Unless I'm misunderstanding something, JFS also has a
> >> working fsck
> >> (which has actually performed successful repair of
> >> real-world
> >> filesystem corruption for me, although I haven't used it
> >> as much as
> >> e2fsck or xfs_repair).
> >
> >...like, if it repaired 100 different, non-trivial corruptions, that
> >would be argument.
>
> In the case of XFS, I've repaired maybe two dozen (or so) corruptions
> that might be non-trivial (in most of the cases, the filesystem
> wouldn't even mount before the repair).
>
> >fsck.ext2 survives my torture (in some versions). fsck.vfat never
> >worked for me (likes to segfault), fsck.reiser never worked for me.
>
> BTW, I actually have a test filesystem here (an e2image from an actual
> filesystem I encountered once) that used to cause e2fsck 1.36/1.37 to
> segfault. Strangely, more ancient versions (like what ships in Red Hat
> 7.2) were able to repair it without segfaulting. In a few days, once
> other stuff calms down for me, I need to revisit that and see if the
> bug still exists with 1.39.
It varies a bit bitween versions, but at least e2fsck has regression
test suite... I had nasty e2 corruption in past (suspend wrote 0 onto
strategic place in bitmaps) where it put filesystem in self-destruct
mode. e2fsck reported fixing the corruption, but did not really fix
it... e2fsck was fixed in the meantime.
(I also have way to corrupt ext2 in a way that basically can't be
repaired automatically. Deallocating free block bitmap and putting
data in freed space is an evil way to corrupt filesystem).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-15 9:15 ` Pavel Machek
@ 2006-06-15 9:40 ` Barry K. Nathan
2006-06-15 9:50 ` [Ext2-devel] " Pavel Machek
0 siblings, 1 reply; 295+ messages in thread
From: Barry K. Nathan @ 2006-06-15 9:40 UTC (permalink / raw)
To: Pavel Machek
Cc: Andrew Morton, Matthew Frost, Jeff Garzik, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas
On 6/15/06, Pavel Machek <pavel@suse.cz> wrote:
> Hi!
>
> > >Passes 8 hours of me trying to intentionally break it with weird,
> > >artifical disk corruption.
> > >
> > >I even have script somewhere.
> >
> > Ok, thanks for clarifying.
>
> You can get a copy, it would be interesting to know how JFS/XFS does.
Ok, I would be interested in getting a copy. (Maybe it would be good
to post it in public so that other people can try it too.)
--
-Barry K. Nathan <barryn@pobox.com>
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-15 9:40 ` Barry K. Nathan
@ 2006-06-15 9:50 ` Pavel Machek
0 siblings, 0 replies; 295+ messages in thread
From: Pavel Machek @ 2006-06-15 9:50 UTC (permalink / raw)
To: Barry K. Nathan
Cc: Jeff Garzik, Matthew Frost, Alex Tomas, Linus Torvalds,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
Hi!
> >> >Passes 8 hours of me trying to intentionally break it with weird,
> >> >artifical disk corruption.
> >> >
> >> >I even have script somewhere.
> >>
> >> Ok, thanks for clarifying.
> >
> >You can get a copy, it would be interesting to know how JFS/XFS does.
>
> Ok, I would be interested in getting a copy. (Maybe it would be good
> to post it in public so that other people can try it too.)
It needs some hand-tuning to do maximum damage to the filesystem, yet
keeping filesystem "recognizable". It also depends on fsck returning
reasonable error codes...
Pavel
#!/bin/bash
#
# fscktest
#
# Usage:
# Make sure output is logged somewhere
# First, run fscktest -p as root
# Then you can run fscktest as normal user...
#
prepare() {
SIZE=100000
echo "Creating file..."
cat /dev/zero | head -c $[$SIZE*1024] > test
echo "Making filesystem..."
mkfs.$FS test
echo "Mounting..."
mount test -o loop /mnt || exit "Cant mount"
echo "Copying files..."
cp -a /bin /mnt
cp -a /usr/bin /mnt
cp -a /usr/src/linux /mnt
echo "Syncing..."
sync
echo "Unmounting..."
umount /mnt
echo "Moving..."
mv test fsck.okay
echo "All done."
}
FS=ext2
if [ .$1 == .-p ]; then
prepare
exit
fi
RUN=0
while true; do
RUN=$[$RUN+1]
echo "Run #$RUN"
echo Preparing...
cat fsck.okay > fsck.damaged
echo Damaging...
dd if=/dev/urandom of=fsck.damaged count=10240 seek=7 conv=notrunc
cp fsck.damaged fsck.test
echo First check...
fsck.$FS -fy fsck.damaged
RESULT=$?
if [ $RESULT != 1 -a $RESULT != 2 -a $RESULT != 0 ]; then
echo "Fsck failed in bad way (result = $RESULT)"
exit
fi
echo Second check...
fsck.$FS -fy fsck.damaged
RESULT=$?
if [ $RESULT != 0 ]; then
echo "Fsck lied about its success (result = $RESULT)"
exit
fi
done
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:56 ` Jeff Garzik
2006-06-09 16:07 ` Alex Tomas
@ 2006-06-09 20:52 ` Stephen C. Tweedie
2006-06-09 21:47 ` [Ext2-devel] " Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 20:52 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Stephen Tweedie, ext2-devel@lists.sourceforge.net,
linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
Alex Tomas, Andreas Dilger
Hi,
On Fri, 2006-06-09 at 11:56 -0400, Jeff Garzik wrote:
> Think about how this will be deployed in production, long term.
>
> If extents are not made default at some point, then no one will use the
> feature, and it should not be merged.
Features such as ACLs and SELinux are still not on by default and are
most *definitely* used. This is a bogus argument.
> And when extents are default, you have this blizzard-of-feature-flags
> stealth upgrade event occur _sometime_ after they boot into the new fs
> for the first time.
No. I don't see it ever being forced on in the kernel by default, so
there will be no such "stealth upgrades".
Rather, if it is "made default", that will be done by setting the flag
by default on newly-created filesystems in mke2fs. We won't be playing
magic on existing filesystems.
And to avoid confusion, I am *entirely* open to the idea of making it
only ever default to on in mke2fs at some point in the future where we
batch a set of incompat features with the "ext4" label, so that "mke2fs
-O ext4", or "mke4fs", would set it. That has already been proposed on
ext2-devel; we're nowhere near the stage of making that default yet.
--Stephen
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 20:52 ` Stephen C. Tweedie
@ 2006-06-09 21:47 ` Jeff Garzik
2006-06-10 0:41 ` James Morris
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:47 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Alex Tomas, Andrew Morton, ext2-devel@lists.sourceforge.net,
linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
Andreas Dilger
Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2006-06-09 at 11:56 -0400, Jeff Garzik wrote:
>
>> Think about how this will be deployed in production, long term.
>>
>> If extents are not made default at some point, then no one will use the
>> feature, and it should not be merged.
>
> Features such as ACLs and SELinux are still not on by default and are
> most *definitely* used. This is a bogus argument.
They are on in SElinux-enabled distro installs, AFAIK?
>> And when extents are default, you have this blizzard-of-feature-flags
>> stealth upgrade event occur _sometime_ after they boot into the new fs
>> for the first time.
>
> No. I don't see it ever being forced on in the kernel by default, so
> there will be no such "stealth upgrades".
>
> Rather, if it is "made default", that will be done by setting the flag
> by default on newly-created filesystems in mke2fs. We won't be playing
> magic on existing filesystems.
>
> And to avoid confusion, I am *entirely* open to the idea of making it
> only ever default to on in mke2fs at some point in the future where we
> batch a set of incompat features with the "ext4" label, so that "mke2fs
> -O ext4", or "mke4fs", would set it. That has already been proposed on
> ext2-devel; we're nowhere near the stage of making that default yet.
Sure. And why not bundle that with a vehicle for separating out the
_code_ that deals with ancient formats versus newer formats. A vehicle
that enables the existing ext3 stuff to stabilize and freeze, while
enabling parallel development of new features.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:47 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-10 0:41 ` James Morris
0 siblings, 0 replies; 295+ messages in thread
From: James Morris @ 2006-06-10 0:41 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
linux-fsdevel, Alex Tomas, Andreas Dilger
On Fri, 9 Jun 2006, Jeff Garzik wrote:
> Stephen C. Tweedie wrote:
> > Hi,
> >
> > On Fri, 2006-06-09 at 11:56 -0400, Jeff Garzik wrote:
> >
> > > Think about how this will be deployed in production, long term.
> > >
> > > If extents are not made default at some point, then no one will use the
> > > feature, and it should not be merged.
> >
> > Features such as ACLs and SELinux are still not on by default and are
> > most *definitely* used. This is a bogus argument.
>
> They are on in SElinux-enabled distro installs, AFAIK?
In RHEL & FC, SELinux xattrs are enabled by default, and acls need to be
enabled via a mount option.
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:47 ` Jeff Garzik
2006-06-09 15:55 ` Alex Tomas
@ 2006-06-09 16:01 ` Linus Torvalds
2006-06-09 20:38 ` Stephen C. Tweedie
2 siblings, 0 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:01 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel,
Andreas Dilger
On Fri, 9 Jun 2006, Jeff Garzik wrote:
>
> Linus Torvalds wrote:
> >
> > On Fri, 9 Jun 2006, Jeff Garzik wrote:
> > > Overall, I'm surprised that ext3 developers don't see any of the problems
> > > related to progressive, stealth filesystem upgrades.
> >
> > Hey, they're used to it - they've been doing it for a long time.
>
> Agreed, but my argument is that extents are a Big Deal.
I'm not arguing against you - I'm arguing with you.
I just tried to explain what you saw as "surprising" - the fact that ext3
developers don't see this as a problem at all. They don't see it as a
problem, because it's how they have always worked, since before ext3 was
ext3, and it was just a crazy extension to ext2.
And yes, it's a serious problem. Ext3 is pretty damn messy. It's not as
messy as some, but it sure has potential.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:47 ` Jeff Garzik
2006-06-09 15:55 ` Alex Tomas
2006-06-09 16:01 ` Linus Torvalds
@ 2006-06-09 20:38 ` Stephen C. Tweedie
2 siblings, 0 replies; 295+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 20:38 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Stephen Tweedie, ext2-devel@lists.sourceforge.net,
linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
Andreas Dilger
Hi,
On Fri, 2006-06-09 at 11:47 -0400, Jeff Garzik wrote:
> think about The Experience: Suddenly users that could use 2.4.x and
> 2.6.x are locked into 2.6.18+, by the simple and common act of writing
> to a file.
No.
The default is --- user writes to file on 2.6.18+, goes back to 2.4, and
everything still keeps on working just fine.
Or, user says "I *specifically* request this feature that I *know* is
not compatible with older kernels", and then they get just that.
Extents are not going to be on by default. Please, we've got more sense
than that!
Just like the developer who says "I *specifically* code for this fancy
new vmsplice syscall" gets exactly the same.
--Stephen
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:40 ` Linus Torvalds
2006-06-09 15:47 ` Jeff Garzik
@ 2006-06-09 15:57 ` Jeff Garzik
2006-06-09 16:10 ` [Ext2-devel] " Alex Tomas
2 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel,
Andreas Dilger
Linus Torvalds wrote:
> Quite frankly, at this point, there's no way in hell I believe we can do
> major surgery on ext3. It's the main filesystem for a lot of users, and
> it's just not worth the instability worries unless it's something very
> obviously transparent.
>
> I wouldn't mind an ext4 (that hopefully drops some of the features of
> ext3, and might not downgrade to ext2 on errors, for example).
Certainly agreed, for all of this :)
I think that the lack of ext4 means people keep trying to stuff the
wrong things into ext3.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 15:40 ` Linus Torvalds
2006-06-09 15:47 ` Jeff Garzik
2006-06-09 15:57 ` Jeff Garzik
@ 2006-06-09 16:10 ` Alex Tomas
2006-06-09 16:10 ` Jeff Garzik
2006-06-09 16:25 ` Linus Torvalds
2 siblings, 2 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 16:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
>>>>> Linus Torvalds (LT) writes:
LT> Quite frankly, at this point, there's no way in hell I believe we can do
LT> major surgery on ext3. It's the main filesystem for a lot of users, and
LT> it's just not worth the instability worries unless it's something very
LT> obviously transparent.
I believe it's as stable as before until you mount with extents
mount option.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:10 ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 16:10 ` Jeff Garzik
2006-06-09 16:24 ` Erik Mouw
` (2 more replies)
2006-06-09 16:25 ` Linus Torvalds
1 sibling, 3 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:10 UTC (permalink / raw)
To: Alex Tomas
Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Linus Torvalds (LT) writes:
>
>
> LT> Quite frankly, at this point, there's no way in hell I believe we can do
> LT> major surgery on ext3. It's the main filesystem for a lot of users, and
> LT> it's just not worth the instability worries unless it's something very
> LT> obviously transparent.
>
> I believe it's as stable as before until you mount with extents
> mount option.
If it will remain a mount option, if it is never made the default
(either in kernel or distro level), then only 1% of users will ever use
the feature. And we shouldn't merge a 1% use feature into the _main_
filesystem for Linux.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:10 ` Jeff Garzik
@ 2006-06-09 16:24 ` Erik Mouw
2006-06-09 16:28 ` Jeff Garzik
2006-06-09 16:24 ` [Ext2-devel] " Chase Venters
2006-06-09 16:25 ` Alex Tomas
2 siblings, 1 reply; 295+ messages in thread
From: Erik Mouw @ 2006-06-09 16:24 UTC (permalink / raw)
To: Jeff Garzik
Cc: Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 12:10:59PM -0400, Jeff Garzik wrote:
> Alex Tomas wrote:
> > I believe it's as stable as before until you mount with extents
> > mount option.
>
> If it will remain a mount option, if it is never made the default
> (either in kernel or distro level), then only 1% of users will ever use
> the feature. And we shouldn't merge a 1% use feature into the _main_
> filesystem for Linux.
Why not? That's how htree dir indexing got in, and AFAIK most distros
use it as a default.
Erik
--
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:24 ` Erik Mouw
@ 2006-06-09 16:28 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:28 UTC (permalink / raw)
To: Erik Mouw
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
Erik Mouw wrote:
> On Fri, Jun 09, 2006 at 12:10:59PM -0400, Jeff Garzik wrote:
>> Alex Tomas wrote:
>>> I believe it's as stable as before until you mount with extents
>>> mount option.
>> If it will remain a mount option, if it is never made the default
>> (either in kernel or distro level), then only 1% of users will ever use
>> the feature. And we shouldn't merge a 1% use feature into the _main_
>> filesystem for Linux.
>
> Why not? That's how htree dir indexing got in, and AFAIK most distros
> use it as a default.
The question is not today's usage, but long term production usage. If
it is destined to be default eventually, then it's not a 1% case.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:10 ` Jeff Garzik
2006-06-09 16:24 ` Erik Mouw
@ 2006-06-09 16:24 ` Chase Venters
2006-06-09 16:25 ` Alex Tomas
2 siblings, 0 replies; 295+ messages in thread
From: Chase Venters @ 2006-06-09 16:24 UTC (permalink / raw)
To: Jeff Garzik
Cc: Alex Tomas, Linus Torvalds, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Jeff Garzik wrote:
> Alex Tomas wrote:
>> > > > > > Linus Torvalds (LT) writes:
>>
>>
>> LT> Quite frankly, at this point, there's no way in hell I believe we can
>> LT> do major surgery on ext3. It's the main filesystem for a lot of users,
>> LT> and it's just not worth the instability worries unless it's something
>> LT> very obviously transparent.
>>
>> I believe it's as stable as before until you mount with extents
>> mount option.
>
> If it will remain a mount option, if it is never made the default (either in
> kernel or distro level), then only 1% of users will ever use the feature.
> And we shouldn't merge a 1% use feature into the _main_ filesystem for Linux.
Pardon me because I haven't made it all the way through this discussion
yet, so I don't know if this has been suggested or dismissed. But I'm
curious - rather than 'stealth upgrade' by way of mount options, why not
just enable the functionality either via tune2fs or mkfs.ext3?
New distribution versions could ship installers that enable it, because users
aren't really going to switch from a new distribution they just install to
an older version (same story on the kernel).
Users that want the functionality today can have it by asking for it with
tune2fs, they just have to bypass the warning that tells them they're not
going to be able to boot kernels before 2.6.xx
> Jeff
Cheers,
Chase
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:10 ` Jeff Garzik
2006-06-09 16:24 ` Erik Mouw
2006-06-09 16:24 ` [Ext2-devel] " Chase Venters
@ 2006-06-09 16:25 ` Alex Tomas
2006-06-09 16:28 ` Jeff Garzik
2 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 16:25 UTC (permalink / raw)
To: Jeff Garzik
Cc: Alex Tomas, Linus Torvalds, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel, Andreas Dilger
>>>>> Jeff Garzik (JG) writes:
JG> If it will remain a mount option, if it is never made the default
JG> (either in kernel or distro level), then only 1% of users will ever
JG> use the feature. And we shouldn't merge a 1% use feature into the
JG> _main_ filesystem for Linux.
strictly speaking, not that many users really need >2TB fs ...
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:25 ` Alex Tomas
@ 2006-06-09 16:28 ` Jeff Garzik
2006-06-09 16:50 ` Alex Tomas
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:28 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> If it will remain a mount option, if it is never made the default
> JG> (either in kernel or distro level), then only 1% of users will ever
> JG> use the feature. And we shouldn't merge a 1% use feature into the
> JG> _main_ filesystem for Linux.
>
> strictly speaking, not that many users really need >2TB fs ...
Not true. Terabyte SATA drives are less than a year away. 2TB
drives... probably 2 years?
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:28 ` Jeff Garzik
@ 2006-06-09 16:50 ` Alex Tomas
2006-06-09 16:53 ` [Ext2-devel] " Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 16:50 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
>>>>> Jeff Garzik (JG) writes:
JG> Alex Tomas wrote:
>>>>>>> Jeff Garzik (JG) writes:
JG> If it will remain a mount option, if it is never made the
>> default
JG> (either in kernel or distro level), then only 1% of users will ever
JG> use the feature. And we shouldn't merge a 1% use feature into the
JG> _main_ filesystem for Linux.
>> strictly speaking, not that many users really need >2TB fs ...
JG> Not true. Terabyte SATA drives are less than a year away. 2TB
JG> drives... probably 2 years?
oh, 2 years sound long enough for defaulting extents?
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:50 ` Alex Tomas
@ 2006-06-09 16:53 ` Jeff Garzik
2006-06-09 17:01 ` Alex Tomas
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:53 UTC (permalink / raw)
To: Alex Tomas
Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> Alex Tomas wrote:
> >>>>>>> Jeff Garzik (JG) writes:
> JG> If it will remain a mount option, if it is never made the
> >> default
> JG> (either in kernel or distro level), then only 1% of users will ever
> JG> use the feature. And we shouldn't merge a 1% use feature into the
> JG> _main_ filesystem for Linux.
> >> strictly speaking, not that many users really need >2TB fs ...
>
> JG> Not true. Terabyte SATA drives are less than a year away. 2TB
> JG> drives... probably 2 years?
>
> oh, 2 years sound long enough for defaulting extents?
If terabyte drives will be here in less than a year, and 750GB drives
are already here, then people with today's commodity hardware are
probably already chomping at the bit to do >2TB LVM and RAID.
Hook eight 750GB SATA drives to a Marvell SATA controller (all
commodity, all production) and you're way past 2TB.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:53 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 17:01 ` Alex Tomas
2006-06-09 17:10 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 17:01 UTC (permalink / raw)
To: Jeff Garzik
Cc: Alex Tomas, Linus Torvalds, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel, Andreas Dilger
that's why we're trying to get it in *now*. because we need it.
and nobody AFAIK insists to make extents default or such.
thanks, Alex
>>>>> Jeff Garzik (JG) writes:
JG> If terabyte drives will be here in less than a year, and 750GB drives
JG> are already here, then people with today's commodity hardware are
JG> probably already chomping at the bit to do >2TB LVM and RAID.
JG> Hook eight 750GB SATA drives to a Marvell SATA controller (all
JG> commodity, all production) and you're way past 2TB.
JG> Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:10 ` [Ext2-devel] " Alex Tomas
2006-06-09 16:10 ` Jeff Garzik
@ 2006-06-09 16:25 ` Linus Torvalds
2006-06-09 16:48 ` Alex Tomas
` (3 more replies)
1 sibling, 4 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:25 UTC (permalink / raw)
To: Alex Tomas
Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Alex Tomas wrote:
>
> I believe it's as stable as before until you mount with extents
> mount option.
That's always a possibility in theory, and almost never in practice.
Btw, I don't care about extents _per_se_. I do care about the fact that
people seem to think that code gets better as it supports more features.
Not so.
The whole logic of "code sharing is good" is a huge mistake. Shared code
is not at all better than individual code snippets, and often much much
worse. In particular, if the shared code has separate code-paths, not just
twice as complicated: it's _more_ than twice as bad, since it introduces
the conditionals _and_ it introduces the very real risk of the conditional
being taken the wrong way by mistake.
In contrast, the last time two different filesystems introduced bugs in
each other was approximately "never". They simply don't modify each others
code, they don't look at each others data structures, and they don't jump
into each others routines.
So two separate filesystems are _less_ to maintain than one big one. Even
if there's a lot of code that -could- be shared.
And no, extents in themselves aren't necessarily "the thing" that drives
it from maintainable to unmaintainable. This crap grows over time. But I
would _serious_ suggest that starting anew with a "new" filesystem, and
taking the time to actually also get _rid_ of some of the baggage would
quite likely be a good idea.
Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
take up way too much space in memory. It has absolutely disgusting code to
handle directory reading and writing (buffer heads! In 2006!). It's
conditional indexing code is horrible. Its performance absolutely sucks
when the journal is being drained or something.
Are you going to improve on any of those _fundamnetal_ problems? Or are
you going to make them worse?
Hint: I'm betting you're not going to improve them by adding more
features.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:25 ` Linus Torvalds
@ 2006-06-09 16:48 ` Alex Tomas
2006-06-09 16:54 ` KELEMEN Peter
2006-06-09 16:55 ` Jeff Garzik
2006-06-09 16:54 ` [Ext2-devel] " Linus Torvalds
` (2 subsequent siblings)
3 siblings, 2 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 16:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
so, instead of taking one (quite-well-tested) part that solves one of
the biggest ext3 limitation, you propose to start a new project and
get something in a year (probably) ?
I think about extents as a step-by-step way ...
thanks, Alex
>>>>> Linus Torvalds (LT) writes:
LT> Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
LT> take up way too much space in memory. It has absolutely disgusting code to
LT> handle directory reading and writing (buffer heads! In 2006!). It's
LT> conditional indexing code is horrible. Its performance absolutely sucks
LT> when the journal is being drained or something.
LT> Are you going to improve on any of those _fundamnetal_ problems? Or are
LT> you going to make them worse?
LT> Hint: I'm betting you're not going to improve them by adding more
LT> features.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:48 ` Alex Tomas
@ 2006-06-09 16:54 ` KELEMEN Peter
2006-06-09 16:55 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: KELEMEN Peter @ 2006-06-09 16:54 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
* Alex Tomas (alex@clusterfs.com) [20060609 20:48]:
> I think about extents as a step-by-step way ...
...so call it ext4 *now* and have a complete rewrite of the whole
codebase as ext5. Users get what they want now (ext4) and Linus
gets what he wants later (ext5). Extents are useful for Joe
Average User with <2 TB filesystems as well.
It's already funny enough that I'm using e2* tools for managing
ext3 filesystems...
Peter
--
.+'''+. .+'''+. .+'''+. .+'''+. .+''
Kelemen Péter / \ / \ Peter.Kelemen@cern.ch
.+' `+...+' `+...+' `+...+' `+...+'
_______________________________________________
Ext2-devel mailing list
Ext2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ext2-devel
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:48 ` Alex Tomas
2006-06-09 16:54 ` KELEMEN Peter
@ 2006-06-09 16:55 ` Jeff Garzik
2006-06-09 17:12 ` [Ext2-devel] " Alex Tomas
` (2 more replies)
1 sibling, 3 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:55 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
> so, instead of taking one (quite-well-tested) part that solves one of
> the biggest ext3 limitation, you propose to start a new project and
> get something in a year (probably) ?
>
> I think about extents as a step-by-step way ...
That is what the entirety of Linux development is -- step-by-step.
It is OBVIOUS that it would take five minutes to start ext4.
1) clone a new tree
2) cp -a fs/ext3 fs/ext4
3) apply extent and 48bit patches
4) apply related e2fsprogs patches
Then update ext4 step-by-step, using the normal Linux development process.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:55 ` Jeff Garzik
@ 2006-06-09 17:12 ` Alex Tomas
2006-06-09 17:12 ` Jeff Garzik
2006-06-09 19:57 ` Theodore Tso
2006-06-10 0:07 ` Olivier Galibert
2 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 17:12 UTC (permalink / raw)
To: Jeff Garzik
Cc: Alex Tomas, Linus Torvalds, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel, Andreas Dilger
>>>>> Jeff Garzik (JG) writes:
JG> That is what the entirety of Linux development is -- step-by-step.
JG> It is OBVIOUS that it would take five minutes to start ext4.
right. it's not a problem to *start*. it's a problem it maintain.
day by day fs/ext3 and fs/ext4 will get more and more diffs.
at some point it will be a headache to apply patches from ext3
to ext4 and back. I known this very well ....
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 17:12 ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 17:12 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:12 UTC (permalink / raw)
To: Alex Tomas
Cc: Linus Torvalds, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> That is what the entirety of Linux development is -- step-by-step.
>
> JG> It is OBVIOUS that it would take five minutes to start ext4.
>
> right. it's not a problem to *start*. it's a problem it maintain.
> day by day fs/ext3 and fs/ext4 will get more and more diffs.
> at some point it will be a headache to apply patches from ext3
> to ext4 and back. I known this very well ....
As Linus has stated, we have empirical evidence that splitting
filesystems works, for both stability and development speed.
The number of patches to ext[23] will trickle off over time. As the
obvious example, ext4 would receive the extent and 48bit patches rather
than ext3 :)
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:55 ` Jeff Garzik
2006-06-09 17:12 ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 19:57 ` Theodore Tso
2006-06-09 20:09 ` Jeff Garzik
` (2 more replies)
2006-06-10 0:07 ` Olivier Galibert
2 siblings, 3 replies; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 19:57 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
On Fri, Jun 09, 2006 at 12:55:09PM -0400, Jeff Garzik wrote:
> That is what the entirety of Linux development is -- step-by-step.
>
> It is OBVIOUS that it would take five minutes to start ext4.
>
> 1) clone a new tree
> 2) cp -a fs/ext3 fs/ext4
> 3) apply extent and 48bit patches
> 4) apply related e2fsprogs patches
>
> Then update ext4 step-by-step, using the normal Linux development process.
We don't do this with the SCSI layer where we make a complete clone of
the driver layer so that there is a /usr/src/linux/driver/scsi and
/usr/src/linux/driver/scsi2, do we? And we didn't do that with the
networking layer either, as we added ipsec, ipv6, softnet, and a whole
host of other changes and improvements.
What we do instead is we have a series of patches, which can be made
available in various experimental trees, and as they get more
polishing and experience with people using it without any problems,
they can get merged into the -mm tree, and then eventually, when they
are deemed ready, into mainline. That is also the normal Linux
development process, and it's worked quite well up until now with ext3.
Folks seem to be worried about ext3 being "too important to experiment
with", but the fact remains, we've been doing continuous improvement
with ext3 for quite some time, and it's been quite smooth. The htree
introduction was essentially completely painless, for example --- and
people liked the fact that they could get the features of indexed
directories without needing to do a complete dump and restore of the
filesystem.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:57 ` Theodore Tso
@ 2006-06-09 20:09 ` Jeff Garzik
2006-06-09 20:14 ` Alex Tomas
2006-06-09 20:38 ` Joel Becker
2006-06-12 8:58 ` Jes Sorensen
2 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:09 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
Theodore Tso wrote:
> We don't do this with the SCSI layer where we make a complete clone of
> the driver layer so that there is a /usr/src/linux/driver/scsi and
> /usr/src/linux/driver/scsi2, do we? And we didn't do that with the
> networking layer either, as we added ipsec, ipv6, softnet, and a whole
> host of other changes and improvements.
>
> What we do instead is we have a series of patches, which can be made
> available in various experimental trees, and as they get more
> polishing and experience with people using it without any problems,
> they can get merged into the -mm tree, and then eventually, when they
> are deemed ready, into mainline. That is also the normal Linux
> development process, and it's worked quite well up until now with ext3.
No, there is a key difference between ext3 and SCSI/etc.: cruft is removed.
In ext3, old formats are supported for all eternity.
> Folks seem to be worried about ext3 being "too important to experiment
> with", but the fact remains, we've been doing continuous improvement
> with ext3 for quite some time, and it's been quite smooth. The htree
> introduction was essentially completely painless, for example --- and
I disagree. There were some distro annoyances as I recall.
> people liked the fact that they could get the features of indexed
> directories without needing to do a complete dump and restore of the
> filesystem.
Of course people always like new features. :)
ext4 should allow you to deliver new features more rapidly, while
keeping the existing ext3 happily stable.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:09 ` Jeff Garzik
@ 2006-06-09 20:14 ` Alex Tomas
2006-06-09 20:28 ` Jeff Garzik
2006-06-19 7:48 ` [Ext2-devel] " Helge Hafting
0 siblings, 2 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 20:14 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Theodore Tso, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger
>>>>> Jeff Garzik (JG) writes:
JG> No, there is a key difference between ext3 and SCSI/etc.: cruft is removed.
JG> In ext3, old formats are supported for all eternity.
we'd need this anyway. just to let users to migrate.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:14 ` Alex Tomas
@ 2006-06-09 20:28 ` Jeff Garzik
2006-06-19 7:48 ` [Ext2-devel] " Helge Hafting
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:28 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Theodore Tso, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> No, there is a key difference between ext3 and SCSI/etc.: cruft is removed.
>
> JG> In ext3, old formats are supported for all eternity.
>
> we'd need this anyway. just to let users to migrate.
No, ext4 should remove some of the crufty old back-compat code.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 20:14 ` Alex Tomas
2006-06-09 20:28 ` Jeff Garzik
@ 2006-06-19 7:48 ` Helge Hafting
1 sibling, 0 replies; 295+ messages in thread
From: Helge Hafting @ 2006-06-19 7:48 UTC (permalink / raw)
To: Alex Tomas
Cc: Jeff Garzik, Theodore Tso, Andrew Morton, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>>>>>>
>
> JG> No, there is a key difference between ext3 and SCSI/etc.: cruft is removed.
>
> JG> In ext3, old formats are supported for all eternity.
>
> we'd need this anyway. just to let users to migrate.
>
Not really. Today, people use reiserfs even though they couldn't
just remount their old ext2 as reiserfs.
An ext2/ext3-incompatible ext4 isn't a problem. Sure, people will
have to mkfs instead of just remounting, and that will mean fewer
quick conversions in the short-term. But people using ext3 today
don't really need ext4 - they are per definition running on sufficiently
small disks/partitions.
So an incompatible ext4 will still see use - on new filesystems mostly.
Not a problem, people buy disks all the time.
Helge Hafting
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:57 ` Theodore Tso
2006-06-09 20:09 ` Jeff Garzik
@ 2006-06-09 20:38 ` Joel Becker
2006-06-09 20:50 ` Dave Jones
2006-06-09 21:03 ` Theodore Tso
2006-06-12 8:58 ` Jes Sorensen
2 siblings, 2 replies; 295+ messages in thread
From: Joel Becker @ 2006-06-09 20:38 UTC (permalink / raw)
To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 03:57:50PM -0400, Theodore Tso wrote:
> We don't do this with the SCSI layer where we make a complete clone of
> the driver layer so that there is a /usr/src/linux/driver/scsi and
> /usr/src/linux/driver/scsi2, do we? And we didn't do that with the
> networking layer either, as we added ipsec, ipv6, softnet, and a whole
> host of other changes and improvements.
Ted,
We don't have any permanent, physical representation of the
state either. With a filesystem we do. I don't care how many changes
you made to the SCSI stack. The code from a year ago could be entirely
different. However, if the old stack and the new stack both support
card X, then it Just Works. The Adaptec driver is a case in point.
When the new driver was still flaky, folks and distros could select the
old driver with impunity. Running the new driver didn't fundamentally
change your Adaptec card so you couldn't run the old one.
Filesystem features are different. There is a permanent state
that the older code cannot read. Alex claims people just shouldn't use
"-o extents", but the fact is their distro will choose it for them. We
have multiboot machines in our test lab, because like many people we
don't have unlimited funds. What happened when we installed the 2.6
distros? All of a sudden the older 2.4 distros wouldn't mount the
shared filesystems, becuase of ext3 features. This wasn't the kernel
driver, this was merely the tools! Surprise! We made no choice to use
new features, and they were thrust upon us. This will happen to others.
Joel
--
"Sometimes one pays most for the things one gets for nothing."
- Albert Einstein
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:38 ` Joel Becker
@ 2006-06-09 20:50 ` Dave Jones
2006-06-09 21:09 ` Joel Becker
2006-06-09 21:32 ` [Ext2-devel] " Jeff Garzik
2006-06-09 21:03 ` Theodore Tso
1 sibling, 2 replies; 295+ messages in thread
From: Dave Jones @ 2006-06-09 20:50 UTC (permalink / raw)
To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 01:38:03PM -0700, Joel Becker wrote:
> that the older code cannot read. Alex claims people just shouldn't use
> "-o extents", but the fact is their distro will choose it for them.
.. on partitions over a certain size, which couldn't be read with
older ext3 filesystems _anyway_
Enabling it by default on partitions of a size less than those
that need extents seems to be somewhat pointless to me?
Am I missing something fundamental that precludes the use of both
extent-based and current existing filesystems from the same code
simultaneously ?
Dave
--
http://www.codemonkey.org.uk
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:50 ` Dave Jones
@ 2006-06-09 21:09 ` Joel Becker
2006-06-09 21:51 ` Mike Snitzer
2006-06-09 21:32 ` [Ext2-devel] " Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Joel Becker @ 2006-06-09 21:09 UTC (permalink / raw)
To: Dave Jones, Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton,
ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
Andreas Dilger
On Fri, Jun 09, 2006 at 04:50:36PM -0400, Dave Jones wrote:
> On Fri, Jun 09, 2006 at 01:38:03PM -0700, Joel Becker wrote:
> > that the older code cannot read. Alex claims people just shouldn't use
> > "-o extents", but the fact is their distro will choose it for them.
>
> .. on partitions over a certain size, which couldn't be read with
> older ext3 filesystems _anyway_
Certainly that would be fine. Is that what will actually
happen? Experience says no. Even if you get it right in your distro,
not all distros will. Heck, can you promise me that your distro will
provide e2fsprogs updates to its older releases so that multiboot will
continue to work?
Joel
--
"Behind every successful man there's a lot of unsuccessful years."
- Bob Brown
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:09 ` Joel Becker
@ 2006-06-09 21:51 ` Mike Snitzer
0 siblings, 0 replies; 295+ messages in thread
From: Mike Snitzer @ 2006-06-09 21:51 UTC (permalink / raw)
To: Dave Jones, Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton,
ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
Andreas Dilger
On 6/9/06, Joel Becker <Joel.Becker@oracle.com> wrote:
> On Fri, Jun 09, 2006 at 04:50:36PM -0400, Dave Jones wrote:
> > On Fri, Jun 09, 2006 at 01:38:03PM -0700, Joel Becker wrote:
> > > that the older code cannot read. Alex claims people just shouldn't use
> > > "-o extents", but the fact is their distro will choose it for them.
> >
> > .. on partitions over a certain size, which couldn't be read with
> > older ext3 filesystems _anyway_
>
> Certainly that would be fine. Is that what will actually
> happen? Experience says no. Even if you get it right in your distro,
> not all distros will. Heck, can you promise me that your distro will
> provide e2fsprogs updates to its older releases so that multiboot will
> continue to work?
If the kernel were bound by all the stakeholders' ability to _always_
"do the right thing" very little innovation would be possible. These
tenuous arguments of hypothetical (ab)users are tiresome.
If the distro vendor did default to ext3+extents and it screwed your
hypothetical extents-naive user (booting a non-vendor kernel isn't
something your mom is going to do) then they strayed too far from
their Linux comfort-zone. If worst came to worst _THE UPDATED
EXT3UTILS WOULD PREVENT MOUNTING AN EXT3 FS WITH AN INCOMPATIBLE
FEATURE_. God forbid the naive-user get an error when they try
something they shouldn't.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 20:50 ` Dave Jones
2006-06-09 21:09 ` Joel Becker
@ 2006-06-09 21:32 ` Jeff Garzik
2006-06-09 22:56 ` Andreas Dilger
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:32 UTC (permalink / raw)
To: Dave Jones
Cc: Theodore Tso, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
Dave Jones wrote:
> Am I missing something fundamental that precludes the use of both
> extent-based and current existing filesystems from the same code
> simultaneously ?
Nothing precludes it. The point is that introducing major format
changes inline in this manner just complicates the code progressively to
the point where your directory walking, inode block walking, and other
code winds up being
if (new)
...
else
...
_anyway_, at which point it is essentially multiple independent
filesystems. I guarantee this won't be the last fundamental fs metadata
design change people will want to make...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 21:32 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 22:56 ` Andreas Dilger
2006-06-09 23:06 ` Linus Torvalds
2006-06-09 23:09 ` Jeff Garzik
0 siblings, 2 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 22:56 UTC (permalink / raw)
To: Jeff Garzik
Cc: Dave Jones, Theodore Tso, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel
On Jun 09, 2006 17:32 -0400, Jeff Garzik wrote:
> Dave Jones wrote:
> >Am I missing something fundamental that precludes the use of both
> >extent-based and current existing filesystems from the same code
> >simultaneously ?
>
> Nothing precludes it. The point is that introducing major format
> changes inline in this manner just complicates the code progressively to
> the point where your directory walking, inode block walking, and other
> code winds up being
>
> if (new)
> ...
> else
> ...
>
> _anyway_, at which point it is essentially multiple independent
> filesystems. I guarantee this won't be the last fundamental fs metadata
> design change people will want to make...
Umm, and how is this fundamentally different from similar code paths in
the VFS (e.g. O_DIRECT vs regular writes)? Should we make a copy of the
whole write path for each of O_DIRECT, AIO, pwrite, etc writes, or should
we instead add in a small change to the write path than leverages the
majority of the existing code?
What is better, using the 95% of the VFS that is common and change 5% to
work with the filesystem, or duplicate the whole VFS just because 5%
needs to be different?
In the extents case, the large majority of the ext3 code is the same
(directory, inode handling, superblock, etc) and only the on-disk format
for indirect blocks has changed. Yes, we also want to change the block
allocator next in order to improve the performance in conjunction with
extents, but that is purely an in-memory change that has no direct
relation to on-disk layout. The major motivations for the extents format:
(a) more compact on-disk representation for large files (improves unlink
performance, reduces memory usage for metadata)
(b) support for larger filesystems (which will affect everyone soon enough).
(c) integrate well with improved allocation support
For most of the ext3 developers (b) is the primary motivation here, and
given that so many people are vocal about ext3 changes that must mean
that there are a lot of ext3 users here. Does that mean that in the next
few years all of the objectors will abandon ext3 in favour of ext4 or XFS
or JFS or reiserfs or reiser4 when you get a new system with a single 12TB
disk? And we can delete ext3 then, or will you be happy then that ext3
supports these large disks without any effort on your part?
Maybe we should start by deleting ext2 because it is old and obsolete?
The reality is that we will never merge the forks back once they are made.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 22:56 ` Andreas Dilger
@ 2006-06-09 23:06 ` Linus Torvalds
2006-06-09 23:09 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 23:06 UTC (permalink / raw)
To: Andreas Dilger
Cc: Andrew Morton, Theodore Tso, Jeff Garzik, ext2-devel,
linux-kernel, cmm, linux-fsdevel, Dave Jones, Alex Tomas
On Fri, 9 Jun 2006, Andreas Dilger wrote:
>
> Umm, and how is this fundamentally different from similar code paths in
> the VFS (e.g. O_DIRECT vs regular writes)?
That's a great argument.
You're arguing that your thing is good by pointing to a known disaster
area and saying "but that other thing does it too, so it must be good".
O_DIRECT is a piece of crap, and I'm still sorry that I didn't push back
enough on it. And I _did_ push back on it. But I finally was worn down.
And yes, part of the problem is exactly that it uses _almost_ the same
paths, but not quite.
Oh, well.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 22:56 ` Andreas Dilger
2006-06-09 23:06 ` Linus Torvalds
@ 2006-06-09 23:09 ` Jeff Garzik
2006-06-09 23:37 ` [Ext2-devel] " Andreas Dilger
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 23:09 UTC (permalink / raw)
To: Jeff Garzik, Dave Jones, Theodore Tso, Alex Tomas, Andrew Morton,
ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel
Andreas Dilger wrote:
> Maybe we should start by deleting ext2 because it is old and obsolete?
> The reality is that we will never merge the forks back once they are made.
We _already have_ a relevant example: ext2 -> ext3.
A useful fork is in the tree, and you're working on it.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 23:09 ` Jeff Garzik
@ 2006-06-09 23:37 ` Andreas Dilger
2006-06-09 23:54 ` Linus Torvalds
0 siblings, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 23:37 UTC (permalink / raw)
To: Jeff Garzik
Cc: Dave Jones, Theodore Tso, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel
On Jun 09, 2006 19:09 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >Maybe we should start by deleting ext2 because it is old and obsolete?
> >The reality is that we will never merge the forks back once they are made.
>
> We _already have_ a relevant example: ext2 -> ext3.
>
> A useful fork is in the tree, and you're working on it.
OK, you're right. We'll continue working on the fork (namely ext3) and
when people who care consider those features stable enough they can port
them to ext2. :-)
Like another person pointed out - there are bugs that are fixed in ext3
that aren't in fixed ext2, and vice versa. Even though the ext2 code
is basically dead, new bugs are still found in it.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 23:37 ` [Ext2-devel] " Andreas Dilger
@ 2006-06-09 23:54 ` Linus Torvalds
0 siblings, 0 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 23:54 UTC (permalink / raw)
To: Andreas Dilger
Cc: Andrew Morton, Theodore Tso, Jeff Garzik, ext2-devel,
linux-kernel, cmm, linux-fsdevel, Dave Jones, Alex Tomas
On Fri, 9 Jun 2006, Andreas Dilger wrote:
>
> OK, you're right. We'll continue working on the fork (namely ext3) and
> when people who care consider those features stable enough they can port
> them to ext2. :-)
You're totally inappropriately focused on this whole "porting back" side.
THE WHOLE POINT IS TO NOT PORT THINGS BACK.
There is absolutely no point in any ext4 work being ported back to ext3,
since the whole point is a fork like this is to have the "stable" thing.
Yes, old bugs happen and sometimes exist in both, but hey, the number of
duplicated bugs - while not non-zero - is still less than the bugs
introduced by trying to keep things in sync.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:38 ` Joel Becker
2006-06-09 20:50 ` Dave Jones
@ 2006-06-09 21:03 ` Theodore Tso
2006-06-09 21:24 ` Joel Becker
2006-06-09 23:48 ` Jeff Garzik
1 sibling, 2 replies; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 21:03 UTC (permalink / raw)
To: Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 01:38:03PM -0700, Joel Becker wrote:
> Filesystem features are different. There is a permanent state
> that the older code cannot read. Alex claims people just shouldn't use
> "-o extents", but the fact is their distro will choose it for them. We
> have multiboot machines in our test lab, because like many people we
> don't have unlimited funds. What happened when we installed the 2.6
> distros? All of a sudden the older 2.4 distros wouldn't mount the
> shared filesystems, becuase of ext3 features.
This is going to happen regardless of whether we call the code base
"ext3" or "ext4". Anytime you make format changes (in this case, to
support larger disk sizes) older kernels won't support it any more.
Surprise!
In the case you were referring to, one specific distribution, Red Hat,
silently added the extended attributes feature to the filesystem
because it was needed by SELINUX. This was actually a backwards
compatible feature, so that older 2.4 based distributions could
*mount* the filesystem. Unfortunately e2fsck needs to be more
careful, and so the problem was that the older distribution's fsck
wasn't able to check the filesystem. But this was actually Red Hat's
fault, in that they shouldn't have transparently added the extended
attribute feature without first asking the user's permission.
Being able to forward upgrade to newer filesystem formats is a good
thing, and has a long history; users don't like to do a backup,
reformat, and restore pass if they can't help that. Heck, Microsoft
Windows even has a way that they can upgrade a FAT filesystem to their
latest NTFSv5 filesystem using a userspace progam. Providing such a
capability is not a bad thing, and in fact it is a good thing. The
bad thing to do is to do the conversion without first asking the
user's permission (for example just as Windows XP does when you first
boot a preinstalled system on a laptop for the first time).
People seem to be arguing that just because an distribution installer
_could_ do a backwards incompatible upgrade without first asking
permission first, we must not provide the capability at all, and make
it be the case that the only way to upgrade from ext3 to ext4 is with
a backup, reformat, and restore. Surely that doesn't make sense!
> This wasn't the kernel driver, this was merely the tools! Surprise!
> We made no choice to use new features, and they were thrust upon us.
> This will happen to others.
I suspect that Red Hat has learned from that past experience, and
won't be making that mistake again, at least without explicitly
requesting the user's permission. So how about we trust the
distributions to be a bit more careful this time around?
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:03 ` Theodore Tso
@ 2006-06-09 21:24 ` Joel Becker
2006-06-09 21:36 ` [Ext2-devel] " Chase Venters
2006-06-09 21:51 ` Theodore Tso
2006-06-09 23:48 ` Jeff Garzik
1 sibling, 2 replies; 295+ messages in thread
From: Joel Becker @ 2006-06-09 21:24 UTC (permalink / raw)
To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 05:03:19PM -0400, Theodore Tso wrote:
> This is going to happen regardless of whether we call the code base
> "ext3" or "ext4". Anytime you make format changes (in this case, to
> support larger disk sizes) older kernels won't support it any more.
> Surprise!
Of course format changes break things. But if you claim that
"X" and "Y" are the same thing, and they aren't, people won't see it
coming.
> wasn't able to check the filesystem. But this was actually Red Hat's
> fault, in that they shouldn't have transparently added the extended
> attribute feature without first asking the user's permission.
Sure it was Red Hat's fault. Knowing who to blame doesn't solve
the existing problem, though. They never even put out e2fsck upgrades
for older distros, which would have solved the problem just as easily.
> Being able to forward upgrade to newer filesystem formats is a good
> thing, and has a long history; users don't like to do a backup,
> reformat, and restore pass if they can't help that. Heck, Microsoft
> Windows even has a way that they can upgrade a FAT filesystem to their
> latest NTFSv5 filesystem using a userspace progam. Providing such a
> capability is not a bad thing, and in fact it is a good thing. The
> bad thing to do is to do the conversion without first asking the
> user's permission (for example just as Windows XP does when you first
> boot a preinstalled system on a laptop for the first time).
This entire statement is true. However, note that "FAT" becomes
"NTFSv5", and there is no expectation, implicit or explicit, that you
can use "FAT" to mount the changed volume.
You can call the new filesystem ext4, and mount an old ext3 as
ext4, and guess what? You're just as forward compatible, but now you've
explictly specified the lack of backwards compatibility. You could even
provide a userspace tool just like in your example to switch an INCOMPAT
feature.
> People seem to be arguing that just because an distribution installer
> _could_ do a backwards incompatible upgrade without first asking
> permission first, we must not provide the capability at all, and make
> it be the case that the only way to upgrade from ext3 to ext4 is with
> a backup, reformat, and restore. Surely that doesn't make sense!
There is no reason you need a backup/restore cycle. Mount it as
ext4, and forever forward its an ext4. In the ext2->ext3 cycle, we
called it "tunefs -J".
> So how about we trust the distributions to be a bit more careful
> this time around?
Haha, you're funny.
Seriously, Ted, I personally have one concern here. I don't
care much about the maintainability of one code base versus two. Both
have advantages and problems. I care a little that my "used to be
stable" ext3 code base might be destabilized, but I know that the ext2/3
team has been better than most at stable code transitions.
What I do care is that "ext3" can no longer mount partition X.
That's gonna happen. This thing still has the same name, but it is in
actuality something very different. When ext2 could no longer mount a
journaled version of itself, we changed it to "ext3".
Heck, forget the name, just make the breakage more explicit. Do
it at mkfs/tunefs time. "tunefs -extents" or "mkfs -t ext3 -extents".
A mount option assumes that you can do with or without it. If you do it
once, you can mount the next time without it and stuff Just Works. Even
htree follows this. A clean unmount leaves a clean directory structure
that a non-htree driver can use.
Joel
--
"Not being known doesn't stop the truth from being true."
- Richard Bach
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 21:24 ` Joel Becker
@ 2006-06-09 21:36 ` Chase Venters
2006-06-09 21:51 ` Theodore Tso
1 sibling, 0 replies; 295+ messages in thread
From: Chase Venters @ 2006-06-09 21:36 UTC (permalink / raw)
To: Joel Becker
Cc: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Joel Becker wrote:
> Heck, forget the name, just make the breakage more explicit. Do
> it at mkfs/tunefs time. "tunefs -extents" or "mkfs -t ext3 -extents".
> A mount option assumes that you can do with or without it. If you do it
> once, you can mount the next time without it and stuff Just Works. Even
> htree follows this. A clean unmount leaves a clean directory structure
> that a non-htree driver can use.
I suggested this somewhere back in the thread and it got no play. What's
the problem with doing things this way? (Aside from it being a compromise
that doesn't automatically result in a new ext4)
Of course, there are a few debates going on here. Only one of them is
about compatibility.
>
> Joel
>
Cheers,
Chase
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:24 ` Joel Becker
2006-06-09 21:36 ` [Ext2-devel] " Chase Venters
@ 2006-06-09 21:51 ` Theodore Tso
2006-06-09 22:07 ` Joel Becker
1 sibling, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 21:51 UTC (permalink / raw)
To: Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 02:24:10PM -0700, Joel Becker wrote:
> Heck, forget the name, just make the breakage more explicit. Do
> it at mkfs/tunefs time. "tunefs -extents" or "mkfs -t ext3 -extents".
> A mount option assumes that you can do with or without it. If you do it
> once, you can mount the next time without it and stuff Just Works. Even
> htree follows this. A clean unmount leaves a clean directory structure
> that a non-htree driver can use.
Agreed; I've was never a fan of how we enabled extended attributes
using a mount option, as it clutters the /etc/fstab line, among other
things. (I added the tune2fs -o feature so that default mount options
could be stored in the superblock to try to cover that design botch,
but the real answer is that extended attributes should never have been
done via a mount option, or at least not only as the right only thing
you had to do before the feature became enabled.)
The right approach is what we did with journaling (tune2fs -j or
tune2fs -O has_journal) and what we did with htree support (tune2fs -O
dir_index), to explicitly enable that feature, and that's certainly
what I will be pushing for.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:51 ` Theodore Tso
@ 2006-06-09 22:07 ` Joel Becker
2006-06-09 22:31 ` [Ext2-devel] " Theodore Tso
0 siblings, 1 reply; 295+ messages in thread
From: Joel Becker @ 2006-06-09 22:07 UTC (permalink / raw)
To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 05:51:37PM -0400, Theodore Tso wrote:
> The right approach is what we did with journaling (tune2fs -j or
> tune2fs -O has_journal) and what we did with htree support (tune2fs -O
> dir_index), to explicitly enable that feature, and that's certainly
> what I will be pushing for.
Excellent. And now let's close the other side of compatibility.
The attribute problem we discussed with e2fsck has a simple solution:
exit cleanly when you don't understand a filesystem.
If e2fsck finds an INCOMPAT flag it doesn't understand, it
didn't *fail* to fsck, it just plain doesn't understand the filesystem.
This should not, in any way, prevent bootup from continuing. Later,
mount may succeed (if the kernel is new enough) or fail (if not), but my
system won't be completely unusable by surprise (assuming that / isn't
the affected filesystem).
Joel
--
Bram's Law:
The easier a piece of software is to write, the worse it's
implemented in practice.
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 22:07 ` Joel Becker
@ 2006-06-09 22:31 ` Theodore Tso
2006-06-09 22:47 ` Joel Becker
0 siblings, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 22:31 UTC (permalink / raw)
To: Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 03:07:11PM -0700, Joel Becker wrote:
> Excellent. And now let's close the other side of compatibility.
> The attribute problem we discussed with e2fsck has a simple solution:
> exit cleanly when you don't understand a filesystem.
> If e2fsck finds an INCOMPAT flag it doesn't understand, it
> didn't *fail* to fsck, it just plain doesn't understand the filesystem.
> This should not, in any way, prevent bootup from continuing. Later,
> mount may succeed (if the kernel is new enough) or fail (if not), but my
> system won't be completely unusable by surprise (assuming that / isn't
> the affected filesystem).
The potential problem with this is that system administrator may never
realize that the filesystem is just getting silently skipped. (And a
big fat warning printed by e2fsck doesn't help when distro's like
Ubuntu use a graphical boot sequence that hides warning messages
printed by e2fsck).
Is it really that hard to edit /etc/fstab so that the fsck pass is
skipped?
I might be willing to make it be a configurable option in
/etc/e2fsck.conf, but it *is* dangerous to have e2fsck exit with
success without having actually checked the filesystem.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 22:31 ` [Ext2-devel] " Theodore Tso
@ 2006-06-09 22:47 ` Joel Becker
2006-06-09 23:54 ` [Ext2-devel] " Theodore Tso
0 siblings, 1 reply; 295+ messages in thread
From: Joel Becker @ 2006-06-09 22:47 UTC (permalink / raw)
To: Theodore Tso, Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 06:31:29PM -0400, Theodore Tso wrote:
> The potential problem with this is that system administrator may never
> realize that the filesystem is just getting silently skipped. (And a
> big fat warning printed by e2fsck doesn't help when distro's like
> Ubuntu use a graphical boot sequence that hides warning messages
> printed by e2fsck).
Yeah, you're not the only one to point this out.
> Is it really that hard to edit /etc/fstab so that the fsck pass is
> skipped?
Depends on how close you are in proximity to the console, I
suspect. Point is, something _broke_.
Regardless of the name, clearly we have a _different_
filesystem. With a clean unmount, a journaled ext3 is still a valid
ext2. With a clean unmount, an extended-attribute ext3 is still a valid
plain ext3 and a valid ext2. With a clean unmount, a dir_index ext3 is
still a valid plain ext3 and a valid ext2. An extents ext3 is NEVER a
valid plain ext3 or ext2.
What happens today if you have a filesystem in fstab that
has no fsck in /sbin (eg, we all pick the name 'ext4', it says 'ext4' in
fstab, but there is no /sbin/fsck.ext4)? Does "fsck -a" skip the
partition, or halt and fail the boot? If the latter, I suspect that the
only solution is "I hope you don't encounter this on remote machines ha
ha ha". If it skips, we have yet another reason that using the same
name is a bad thing.
Joel
--
"Sometimes when reading Goethe I have the paralyzing suspicion
that he is trying to be funny."
- Guy Davenport
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 22:47 ` Joel Becker
@ 2006-06-09 23:54 ` Theodore Tso
0 siblings, 0 replies; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 23:54 UTC (permalink / raw)
To: Jeff Garzik, Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 03:47:00PM -0700, Joel Becker wrote:
> What happens today if you have a filesystem in fstab that
> has no fsck in /sbin (eg, we all pick the name 'ext4', it says 'ext4' in
> fstab, but there is no /sbin/fsck.ext4)? Does "fsck -a" skip the
> partition, or halt and fail the boot? If the latter, I suspect that the
> only solution is "I hope you don't encounter this on remote machines ha
> ha ha".
It will halt and fail the boot.
Of course, installing a kernel more recent on 2.6.14 or so a RHEL4
system when you have a SCSI controller such as MPT Fusion will also
cause the system to fail to boot unless you remember to compile it
directly into the kernel because of changes in semantics about whether
the SCSI probing happens before or after the module load completes ---
and the answer that has been given is "we don't care". So these sorts
of traps have been around for people who are going back and forth
between the bleeding edge and distro systems, but I think we'd all
agree that this isn't necessarily the common case.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:03 ` Theodore Tso
2006-06-09 21:24 ` Joel Becker
@ 2006-06-09 23:48 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 23:48 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
Theodore Tso wrote:
> I suspect that Red Hat has learned from that past experience, and
> won't be making that mistake again, at least without explicitly
> requesting the user's permission. So how about we trust the
> distributions to be a bit more careful this time around?
Make the line of demarcation much more clear...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:57 ` Theodore Tso
2006-06-09 20:09 ` Jeff Garzik
2006-06-09 20:38 ` Joel Becker
@ 2006-06-12 8:58 ` Jes Sorensen
2 siblings, 0 replies; 295+ messages in thread
From: Jes Sorensen @ 2006-06-12 8:58 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger
>>>>> "Ted" == Theodore Tso <tytso@mit.edu> writes:
Ted> On Fri, Jun 09, 2006 at 12:55:09PM -0400, Jeff Garzik wrote:
>> 1) clone a new tree 2) cp -a fs/ext3 fs/ext4 3) apply extent and
>> 48bit patches 4) apply related e2fsprogs patches
>>
>> Then update ext4 step-by-step, using the normal Linux development
>> process.
Ted> We don't do this with the SCSI layer where we make a complete
Ted> clone of the driver layer so that there is a
Ted> /usr/src/linux/driver/scsi and /usr/src/linux/driver/scsi2, do
Ted> we? And we didn't do that with the networking layer either, as
Ted> we added ipsec, ipv6, softnet, and a whole host of other changes
Ted> and improvements.
Maybe it's just me, but I am reading oranges vs apples there. The SCSI
comparison is like suggesting we go from the VFS to a VFS2 or fs/ ->
fs2/ for this. On the other hand going from ext3 -> ext4 to get
something incompatible (like enabling extends or 48 bit) is similar to
going net/ipv4 -> net/ipv6, which we did do indeed.
Fact of the matter is that 2.4 is dead or at least frozen solid by
now. The userland of most distros today wouldn't be able to boot with
a 2.4 kernel anyway.
Granted I am not a filesystem expert, but personally I would feel more
comfortable deciding to put my data on an ext4 file system knowing
that it was just that, rather than a hybrid.
Jes
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:55 ` Jeff Garzik
2006-06-09 17:12 ` [Ext2-devel] " Alex Tomas
2006-06-09 19:57 ` Theodore Tso
@ 2006-06-10 0:07 ` Olivier Galibert
2006-06-10 0:13 ` Jeff Garzik
2 siblings, 1 reply; 295+ messages in thread
From: Olivier Galibert @ 2006-06-10 0:07 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
On Fri, Jun 09, 2006 at 12:55:09PM -0400, Jeff Garzik wrote:
> Alex Tomas wrote:
> >so, instead of taking one (quite-well-tested) part that solves one of
> >the biggest ext3 limitation, you propose to start a new project and
> >get something in a year (probably) ?
> >
> >I think about extents as a step-by-step way ...
>
> That is what the entirety of Linux development is -- step-by-step.
>
> It is OBVIOUS that it would take five minutes to start ext4.
>
> 1) clone a new tree
> 2) cp -a fs/ext3 fs/ext4
> 3) apply extent and 48bit patches
> 4) apply related e2fsprogs patches
5) force all options (attributes, etc) on and remove the flags
indicating their existence from the metadata, you'll need the space
for the fs evolution.
6) change the fs just enough so that an ext4 fs can never be mounted
as ext3 or 2.
OG.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 0:07 ` Olivier Galibert
@ 2006-06-10 0:13 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 0:13 UTC (permalink / raw)
To: Olivier Galibert, Jeff Garzik, Alex Tomas, Linus Torvalds,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel,
Andreas Dilger
Olivier Galibert wrote:
> On Fri, Jun 09, 2006 at 12:55:09PM -0400, Jeff Garzik wrote:
>> Alex Tomas wrote:
>>> so, instead of taking one (quite-well-tested) part that solves one of
>>> the biggest ext3 limitation, you propose to start a new project and
>>> get something in a year (probably) ?
>>>
>>> I think about extents as a step-by-step way ...
>> That is what the entirety of Linux development is -- step-by-step.
>>
>> It is OBVIOUS that it would take five minutes to start ext4.
>>
>> 1) clone a new tree
>> 2) cp -a fs/ext3 fs/ext4
>> 3) apply extent and 48bit patches
>> 4) apply related e2fsprogs patches
>
> 5) force all options (attributes, etc) on and remove the flags
> indicating their existence from the metadata, you'll need the space
> for the fs evolution.
>
> 6) change the fs just enough so that an ext4 fs can never be mounted
> as ext3 or 2.
Yeah. And its easy enough just to change the main magic number...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:25 ` Linus Torvalds
2006-06-09 16:48 ` Alex Tomas
@ 2006-06-09 16:54 ` Linus Torvalds
2006-06-09 17:04 ` Alex Tomas
` (2 more replies)
2006-06-09 17:12 ` Jeff Anderson-Lee
2006-06-09 18:02 ` Andrew Morton
3 siblings, 3 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:54 UTC (permalink / raw)
To: Alex Tomas
Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Linus Torvalds wrote:
>
> Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
> take up way too much space in memory.
Btw, I'm not kidding you on this one.
THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!
And you know what? 2TB files are totally uninteresting to 99.9999% of all
people. Most people find it _much_ more interesting to have hundreds of
thousands of _smaller_ files instead.
So do this:
cat /proc/slabinfo | grep ext3
and be absolutely disgusted and horrified by the size of those inodes
already, and ask yourself whether extending the block size to 48 bits will
help or further hurt one of the biggest problems of ext3 right now?
(And yes, I realize that block numbers are just a small part of it. The
"vfs_inode" is also a real problem - it's got _way_ too many large
list-heads that explode on a 64-bit kernel, for example. Oh, well. My
point is that things like this can make a very real issue _worse_ for all
the people who don't care one whit about it)
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:54 ` [Ext2-devel] " Linus Torvalds
@ 2006-06-09 17:04 ` Alex Tomas
2006-06-09 17:30 ` [Ext2-devel] " Linus Torvalds
2006-06-09 17:44 ` Theodore Tso
2006-06-09 18:10 ` [Ext2-devel] " Andreas Dilger
2 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 17:04 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
oops :) I don't follow that well ...
size of in-core inodes is a different problem.
thanks, Alex
>>>>> Linus Torvalds (LT) writes:
LT> On Fri, 9 Jun 2006, Linus Torvalds wrote:
>>
>> Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
>> take up way too much space in memory.
LT> Btw, I'm not kidding you on this one.
LT> THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!
LT> And you know what? 2TB files are totally uninteresting to 99.9999% of all
LT> people. Most people find it _much_ more interesting to have hundreds of
LT> thousands of _smaller_ files instead.
LT> So do this:
LT> cat /proc/slabinfo | grep ext3
LT> and be absolutely disgusted and horrified by the size of those inodes
LT> already, and ask yourself whether extending the block size to 48 bits will
LT> help or further hurt one of the biggest problems of ext3 right now?
LT> (And yes, I realize that block numbers are just a small part of it. The
LT> "vfs_inode" is also a real problem - it's got _way_ too many large
LT> list-heads that explode on a 64-bit kernel, for example. Oh, well. My
LT> point is that things like this can make a very real issue _worse_ for all
LT> the people who don't care one whit about it)
LT> Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 17:04 ` Alex Tomas
@ 2006-06-09 17:30 ` Linus Torvalds
2006-06-09 17:41 ` Matthew Wilcox
0 siblings, 1 reply; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 17:30 UTC (permalink / raw)
To: Alex Tomas
Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Alex Tomas wrote:
>
> oops :) I don't follow that well ...
>
> size of in-core inodes is a different problem.
Not really. It's really the same problem: adding features has a real cost.
And the cost is higher if you don't add them in a way that is statically
separable.
So I'm not trying to make the in-core inode size be "the thing" to
concentrate on. And I'm not saying that extents is inherently "the thing"
that makes it sane to split up development. That time might have been a
few years ago, or it might be in the future.
So don't get me wrong. I'm (a) generally supporting Jeff in that I think
it makes sense to split projects off occasionally, and maybe even plan on
hopefully make the original project be deleted in the long run (it does
actually happen, although it is fairly rare). And (b) trying to show the
costs.
For me, the biggest cost tends to actually be support. A stable filesystem
that is used by thousands and thousands of people and that isn't actually
developed outside of just maintaining it IS A REALLY GOOD THING TO HAVE.
And I'm not saying that just because it's a filesystem, and people get
upset if they lose data. No, I'm saying it because from a maintenance
standpoint, such a filesystem has almost zero cost.
So from a maintenance stanpoint, it's actually a _lot_ more useful to me
(and probably to a lot of other people) if development is done as its own
project, and is merged as its own sub-project. When problems happen, it's
fairly obvious what they are, and it's very much a case of all the people
involved having made that choice ("Hey, you knew it wasn't as stable, but
you wanted it for your special needs").
As an additional bonus, it tends to help find patterns in bug-reports
("ahh, everyone involved is running ext4"). So not only does it not affect
people who don't want to be affected, it also helps _pinpoint_ where
problems are when they do happen.
Also, if it turns out that the stabilization thing worked well, and after
a few years the _new_ code hasn't gotten any changes, and there are no
other real downsides either, they can actually be merged later on too.
That's what we're seeing in the 64-bit architecture support on both s390
and powerpc (and maybe even x86, eventually? Possibly not, but who
knows..). But that's a separate issue.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 17:30 ` [Ext2-devel] " Linus Torvalds
@ 2006-06-09 17:41 ` Matthew Wilcox
2006-06-09 17:50 ` Jeff Garzik
` (2 more replies)
0 siblings, 3 replies; 295+ messages in thread
From: Matthew Wilcox @ 2006-06-09 17:41 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
On Fri, Jun 09, 2006 at 10:30:06AM -0700, Linus Torvalds wrote:
> And I'm not saying that just because it's a filesystem, and people get
> upset if they lose data. No, I'm saying it because from a maintenance
> standpoint, such a filesystem has almost zero cost.
One of the costs (and I'm not disagreeing with your main point;
I think forking ext3 to ext4 at this point is reasonable), is that
bugfixes applied to one don't necessarily get applied to the other.
I found some recently between ext2 and ext3, and submitted those, but I
only audited one file. There's lots more to look at and I just haven't
found the time recently. Going to three variations is a lot more work
for auditing, and it might be worth splitting some bits which genuinely
are the same into common code.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 17:41 ` Matthew Wilcox
@ 2006-06-09 17:50 ` Jeff Garzik
2006-06-09 18:00 ` Alex Tomas
2006-06-09 18:04 ` [Ext2-devel] " Linus Torvalds
2006-06-09 18:17 ` Michael Poole
2 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:50 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
Matthew Wilcox wrote:
> On Fri, Jun 09, 2006 at 10:30:06AM -0700, Linus Torvalds wrote:
>> And I'm not saying that just because it's a filesystem, and people get
>> upset if they lose data. No, I'm saying it because from a maintenance
>> standpoint, such a filesystem has almost zero cost.
>
> One of the costs (and I'm not disagreeing with your main point;
> I think forking ext3 to ext4 at this point is reasonable), is that
> bugfixes applied to one don't necessarily get applied to the other.
> I found some recently between ext2 and ext3, and submitted those, but I
> only audited one file. There's lots more to look at and I just haven't
> found the time recently. Going to three variations is a lot more work
> for auditing, and it might be worth splitting some bits which genuinely
> are the same into common code.
With extents and 48bit, you have multiple code paths to audit, regardless.
If applied to ext3, you have to audit
fs/ext3/*.c:
if (extents)
...
else
...
as opposed to
fs/ext3/*.c:
... non-extent code
fs/ext4/*.c:
... extent code
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 17:50 ` Jeff Garzik
@ 2006-06-09 18:00 ` Alex Tomas
0 siblings, 0 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 18:00 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Matthew Wilcox, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger
IMHO, 3 (three) if's for a whole fs don't look that bad.
on the other side, you'd need to audit much more of
almost the same lines ...
thanks, Alex
>>>>> Jeff Garzik (JG) writes:
JG> With extents and 48bit, you have multiple code paths to audit, regardless.
JG> If applied to ext3, you have to audit
JG> fs/ext3/*.c:
JG> if (extents)
JG> ...
JG> else
JG> ...
JG> as opposed to
JG> fs/ext3/*.c:
JG> ... non-extent code
JG> fs/ext4/*.c:
JG> ... extent code
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 17:41 ` Matthew Wilcox
2006-06-09 17:50 ` Jeff Garzik
@ 2006-06-09 18:04 ` Linus Torvalds
2006-06-09 18:17 ` Michael Poole
2 siblings, 0 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 18:04 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Alex Tomas, Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
cmm, linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Matthew Wilcox wrote:
>
> One of the costs (and I'm not disagreeing with your main point;
> I think forking ext3 to ext4 at this point is reasonable), is that
> bugfixes applied to one don't necessarily get applied to the other.
I agree. However, that tends to be less of an issue of you fork off a
stable base (which isn't always the case). Forking off something that is
being stil actively developed is a different matter entirely. I don't
think ext3 is in that situation, really.
Also, one of the issues is when there are big VFS layer changes, which
affect all filesystems. Then, a lot of people will think that it's easier
to fix up one unified filesystem than it is to fix up five separate ones,
and the fact is, that's often _not_ the case.
The unified filesystem potentially has so much crud and crap and other
issues that it ends up being much more work to understand and fix it up
than it would have been to do the same thing for five different
filesystems that didn't play a lot of games and have complex
"if this flag is set, do this code, otherwise do that code, and this
whole directory reading code btw has a static CONFIG_EXT3_INDEX thing,
so you won't even know if you caught all the interface changes when you
get a clean compile"
So I'm not a huge believer in "shared code is good code". I believe shared
code is good only if it has no conditionals.
Ie the VFS-layer kind of code that acts the SAME for everybody is the good
kind of sharing. The kind where you call into different routines that will
do different things depending on a flag (which may not even be obvious to
the caller) is usually the _bad_ kind of sharing, because that's the kind
of code that ends up working for one user and not working for another, and
trying to make it work for both may be fundamentally hard.
The
if (sb->option.extent)
.. do one thing ..
else
.. do another ..
kind of thing is exactly what leads to problems later. Even if it allows
sharing of 90% of the code (the caller of the function), it leads to
problems exactly because of things that end up not quite working because
people only tested one code-path, and it broke the other case in some
really subtle way.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 17:41 ` Matthew Wilcox
2006-06-09 17:50 ` Jeff Garzik
2006-06-09 18:04 ` [Ext2-devel] " Linus Torvalds
@ 2006-06-09 18:17 ` Michael Poole
2 siblings, 0 replies; 295+ messages in thread
From: Michael Poole @ 2006-06-09 18:17 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel, Andreas Dilger
Matthew Wilcox writes:
> On Fri, Jun 09, 2006 at 10:30:06AM -0700, Linus Torvalds wrote:
> > And I'm not saying that just because it's a filesystem, and people get
> > upset if they lose data. No, I'm saying it because from a maintenance
> > standpoint, such a filesystem has almost zero cost.
>
> One of the costs (and I'm not disagreeing with your main point;
> I think forking ext3 to ext4 at this point is reasonable), is that
> bugfixes applied to one don't necessarily get applied to the other.
> I found some recently between ext2 and ext3, and submitted those, but I
> only audited one file. There's lots more to look at and I just haven't
> found the time recently. Going to three variations is a lot more work
> for auditing, and it might be worth splitting some bits which genuinely
> are the same into common code.
If you want more details on this kind of issue, look at CP-Miner. A
paper published earlier this year in IEEE TSE[1] reports that that
tool found 421 cut-and-paste-related possible bugs in Linux, of which
49 were real bugs, 249 were false positives, and 123 could not be
proven either true or false positives.
[1]- http://doi.ieeecomputersociety.org/10.1109/TSE.2006.28
Michael Poole
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:54 ` [Ext2-devel] " Linus Torvalds
2006-06-09 17:04 ` Alex Tomas
@ 2006-06-09 17:44 ` Theodore Tso
2006-06-09 17:58 ` Jeff Garzik
2006-06-09 18:10 ` [Ext2-devel] " Andreas Dilger
2 siblings, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 17:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
On Fri, Jun 09, 2006 at 09:54:49AM -0700, Linus Torvalds wrote:
>
>
> On Fri, 9 Jun 2006, Linus Torvalds wrote:
> >
> > Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
> > take up way too much space in memory.
>
> Btw, I'm not kidding you on this one.
>
> THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!
To be fair, the bulk of the size of the size of the inode is is the
filesystem generic "struct inode", which is 480 bytes. Ext3 just
includes the struct inode as part of its core data structure, which
makes the whole thing *look* big. In fact, the ext3-specific part of
the in-core ext3 inode is only 188 bytes, for a total of 688 bytes for
the ext3_inode_info structure --- which is what you see in
/proc/slabinfo.
Other filesystems store the struct inode via a pointer to a separately
allocated chunk of memory, which makes their in-core inode footprint
*look* smaller but that's just an illusion if the only place you look
is /proc/slabinfo. For example, the xfs_inode_cache is 432 bytes, but
that's because struct inode is stored separately from xfs's
fake/pseudo "vfs inode" which it keeps around so the same code can be
used with Irix. (It always amazes me that we allow this for XFS,
where when everywhere else we insist that that kind of cross-OS or
cross-version portability code is a fundamental violation of
CodingStyle which by increasing code bloat and making the code harder
to read and maintain by Linux developers, but that's a rant for
another day.)
Now, obviously I won't say that we can't do work to trim down
ext3_inode_info. To be fair, reiserfs has an inode which is 576 bytes
long, so they only have 96 bytes of filesystem-specific information,
instead of the 188 bytes that we have in ext3. So we can do look at
that, but remember that from the gross level, we're talking about 688
bytes per inode for ext3 compared to 576 bytes per inode for reiserfs
--- and at least 912 bytes per inode for xfs.
But I think you would agree that we would want to improve this number
"honestly", by trying to trim down actual memory structure use,
instead of just simply making the in-core data structure bushier so as
to hide the true size of the per-inode footprint from people looking
at /proc/slabinfo, right? :-)
And in any case, this is why we have to think very carefully before
forking the codebase between ext3 and "ext4". The work that we might
use to slim down ext4_inode_info would also have to be backported to
ext3_inode_info before ext3 users see the benefit. And there may also
be bugs that now have to be fixed in _three_ separate codebases ---
ext2, ext3, and ext4. To give another concrete example, adding
extents won't change the htree directory lookup code, so needlessly
having two copies of that htree code in the kernel would be a Bad
Idea(tm). We've already on occasion found bugs that we had fixed in
ext3, but had forgotten to backport to ext2, and vice versa. Adding a
third would triple our maintenance headache --- a similar reason why
we haven't started a 2.7 development tree yet, since we would have to
backport bug fixes back and forth between 2.6 and 2.7.
Not to say that forking ext3 to make a copy of the code that we call
"ext4" isn't automatically a bad idea to be dismissed out of hand,
just as someday that we might fork 2.6 and start a 2.7 development
branch. But in both cases we need to think very hard about the
tradeoffs before we just go ahead and do it.
Regards,
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 17:44 ` Theodore Tso
@ 2006-06-09 17:58 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:58 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
Theodore Tso wrote:
> And in any case, this is why we have to think very carefully before
> forking the codebase between ext3 and "ext4". The work that we might
> use to slim down ext4_inode_info would also have to be backported to
> ext3_inode_info before ext3 users see the benefit. And there may also
No, the entire point is that you stop backporting all the junk, and just
leave ext3 as is. Let it sit, let it stabilize.
New development -- including inode slimming work -- can be best done in
ext4. With ext3, you are fighting all those old back-compat features
and associated code paths bloating up the in-core inode [code].
_Obviously_ there may be bugs found in three codebases, rather than two.
But over time those will trickle off, particularly when developers
successfully resist the urge to continue modifying ext[23].
There will always newer, bigger storage situations and arrays, and I
think it's a mistake to continue modifying the same Linux filesystem to
support all these situations. The logical end result is a big, unwieldy
codebase that supports $N metadata, data, and journal formats.
In the same way we don't stuff support for all PCI ethernet or SATA
drivers into the same .o file, we shouldn't keep stuffing support for
all these varying filesystem formats into ext3.o. That creates (and
extents exacerbate) the "what ext3 fs am I mounting, today?" support
problem.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:54 ` [Ext2-devel] " Linus Torvalds
2006-06-09 17:04 ` Alex Tomas
2006-06-09 17:44 ` Theodore Tso
@ 2006-06-09 18:10 ` Andreas Dilger
2006-06-09 18:22 ` Linus Torvalds
` (2 more replies)
2 siblings, 3 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 18:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alex Tomas, Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
cmm, linux-fsdevel
On Jun 09, 2006 09:25 -700, Linus Torvalds wrote:
> So two separate filesystems are _less_ to maintain than one big one. Even
> if there's a lot of code that -could- be shared.
That is true if people are willing to maintain both trees. I think that
even with the current ext2/ext3 split there are continually fixes that are
missing from one filesystem or another.
> Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
> take up way too much space in memory. It has absolutely disgusting code to
> handle directory reading and writing (buffer heads! In 2006!).
My point exactly! The ext2 directory code was moved from buffer heads to
page cache by Al after ext3 was forked and the code was never fixed in ext3.
I don't see this getting any better if there is an ext4 filesystem and all
of the ext3 developers are only interested in maintaining ext4. Look at
reiserfs - it is completely abandoned by Hans in favour of reiser4 (the
entry in MAINTAINERS notwithstanding) except for Chris Mason at SuSE.
Having a single codebase for everyone means that it is continually maintained
and users of ext3 aren't left out in the cold.
On Jun 09, 2006 09:54 -0700, Linus Torvalds wrote:
> Btw, I'm not kidding you on this one.
>
> THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!
Do you think that would be any different with a new filesystem?
> And you know what? 2TB files are totally uninteresting to 99.9999% of all
> people. Most people find it _much_ more interesting to have hundreds of
> thousands of _smaller_ files instead.
>
> So do this:
>
> cat /proc/slabinfo | grep ext3
# head -2 /proc/slabinfo
slabinfo - version: 2.1
name <active_objs> <num_objs> <objsize> <objperslab>
# grep ext2 /proc/slabinfo
ext2_inode_cache 0 0 572 7
ext2_xattr 0 0 48 81
# grep ext3 /proc/slabinfo
ext3_inode_cache 30207 41418 616 6
ext3_xattr 0 0 48 81
# grep xfs /proc/slabinfo
xfs_ili 2558 2576 140 28
xfs_inode 2558 2565 448 9
# grep jfs /proc/slabinfo
jfs_ip 0 0 1048 3
So, the ext3 inode could grow another ~50 bytes without changing the
slab allocation size ;-), and in fact other filesystem aren't noticably
different.
> and be absolutely disgusted and horrified by the size of those inodes
> already, and ask yourself whether extending the block size to 48 bits will
> help or further hurt one of the biggest problems of ext3 right now?
This is then the biggest problem of all filesystems.
> (And yes, I realize that block numbers are just a small part of it. The
> "vfs_inode" is also a real problem - it's got _way_ too many large
> list-heads that explode on a 64-bit kernel, for example. Oh, well.
On a 32-bit system the vfs_inode is more than half of the size of the ext3
inode, it is worse on 64-bit systems.
> My point is that things like this can make a very real issue _worse_ for all
> the people who don't care one whit about it)
The current group of changes will be a no-op if CONFIG_LBD isn't enabled,
and I think I argued fairly strongly to also have a CONFIG_ flag to allow
larger than 2TB file support only for those users that want it.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:10 ` [Ext2-devel] " Andreas Dilger
@ 2006-06-09 18:22 ` Linus Torvalds
2006-06-09 18:30 ` Alex Tomas
2006-06-09 18:40 ` [Ext2-devel] " Jeff Garzik
2006-06-09 18:41 ` Jeff Garzik
2 siblings, 1 reply; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 18:22 UTC (permalink / raw)
To: Andreas Dilger
Cc: Alex Tomas, Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
cmm, linux-fsdevel
On Fri, 9 Jun 2006, Andreas Dilger wrote:
> missing from one filesystem or another.
>
> > Just as an example: ext3 _sucks_ in many ways. It has huge inodes that
> > take up way too much space in memory. It has absolutely disgusting code to
> > handle directory reading and writing (buffer heads! In 2006!).
>
> My point exactly! The ext2 directory code was moved from buffer heads to
> page cache by Al after ext3 was forked and the code was never fixed in ext3.
The code was never fixed in ext3, because ext3 is a pig in that area.
You misunderstand how this worked.
The reason ext2 got fixed was that ext2 was _simple_. It got fixed
_despite_ the fact that it's not all that widely used any more, and not
considered a really important filesystem. It got fixed because it wasn't
too bad. It doesn't have all the crud that makes it a much more involved
thing to do for ext3.
So if the ext2/3 split hadn't happened, _neither_ of them would be fixed.
See?
My point is, maintaining two different pieces is SIMPLER.
Even if that simplicity sometimes ends up meaning "not maintaining the
other one".
So being out of sync is not a problem. It's a _feature_.
> On Jun 09, 2006 09:54 -0700, Linus Torvalds wrote:
> > Btw, I'm not kidding you on this one.
> >
> > THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!
>
> Do you think that would be any different with a new filesystem?
It would be bigger, if you made ext3 do 48-bit block numbers.
See? ext3 would become strictly _worse_ for the majority of users, who
wouldn't get any advantage. That's my point.
> So, the ext3 inode could grow another ~50 bytes without changing the
> slab allocation size ;-), and in fact other filesystem aren't noticably
> different.
Yes, I already pointed out that the biggest part of it was actually the
vfs_inode thing.
And btw, growing more than 50 bytes is exactly what it would do. Go look.
> This is then the biggest problem of all filesystems.
Yeah, under many loads it is. We do really badly with lots of metadata in
memory. Why do you think people have historically complained about things
like the updatedb flushing their disk cache?
If you look at disk access patterns, one of _the_ biggest problems is not
in readign individual files. It's in inode atime updates and the other
"stupid crap" stuff.
> On a 32-bit system the vfs_inode is more than half of the size of the ext3
> inode, it is worse on 64-bit systems.
..which I pointed out, and doesn't change my point one _whit_.
The fact that the block numbers aren't the _only_ problem doesn't suddenly
mean they are problem-free, does it?
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:22 ` Linus Torvalds
@ 2006-06-09 18:30 ` Alex Tomas
2006-06-09 18:38 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 18:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
>>>>> Linus Torvalds (LT) writes:
LT> My point is, maintaining two different pieces is SIMPLER.
"different" is a key word here. why should we copy most of ext3 code
into ext4?
LT> It would be bigger, if you made ext3 do 48-bit block numbers.
nope, we re-use existing i_data w/o any changes. yes, we've made
inode a bit larger to cache last found extent. this improves
performance in some workloads noticable though.
LT> See? ext3 would become strictly _worse_ for the majority of users, who
LT> wouldn't get any advantage. That's my point.
would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
thanks. Alex
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:30 ` Alex Tomas
@ 2006-06-09 18:38 ` Linus Torvalds
2006-06-09 18:50 ` [Ext2-devel] " Chase Venters
` (2 more replies)
2006-06-09 18:43 ` [Ext2-devel] " Jeff Garzik
2006-06-09 18:50 ` Diego Calleja
2 siblings, 3 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 18:38 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Alex Tomas wrote:
>
> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
Let's put it this way:
- have you had _any_ valid argument at all against "ext4"?
Think about it. Honestly. Tell me anything that doesn't work?
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:38 ` Linus Torvalds
@ 2006-06-09 18:50 ` Chase Venters
2006-06-09 19:00 ` Chase Venters
` (2 more replies)
2006-06-09 19:22 ` Alex Tomas
2006-06-09 20:16 ` Andreas Dilger
2 siblings, 3 replies; 295+ messages in thread
From: Chase Venters @ 2006-06-09 18:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alex Tomas, Andreas Dilger, Jeff Garzik, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
On Fri, 9 Jun 2006, Linus Torvalds wrote:
>
>
> On Fri, 9 Jun 2006, Alex Tomas wrote:
>>
>> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
>
> Let's put it this way:
> - have you had _any_ valid argument at all against "ext4"?
>
> Think about it. Honestly. Tell me anything that doesn't work?
It's about bundling. It's about being able to take your 3-year old
dependable car and make it faster by bolting on new manifolds and
turbochargers, rather than waiting a year for the manufacturer to release
a totally new model (and buying totally new cars often means you're part
of the manufacturer's debugging group, so be prepared to have things fail
which require warranty work).
Now, granted, I really do agree with you about the whole code sharing
thing. A fresh start is often just what you need. I'm just questioning if
it wouldn't be better to do this fresh start immediately after going
48-bit, rather than before. That way, existing users that want that extra
umph can have it today.
> Linus
Cheers,
Chase
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:50 ` [Ext2-devel] " Chase Venters
@ 2006-06-09 19:00 ` Chase Venters
2006-06-10 13:33 ` Adrian Bunk
2006-06-09 19:01 ` Jeff Garzik
2006-06-09 19:21 ` Alan Cox
2 siblings, 1 reply; 295+ messages in thread
From: Chase Venters @ 2006-06-09 19:00 UTC (permalink / raw)
To: Chase Venters
Cc: Linus Torvalds, Alex Tomas, Andreas Dilger, Jeff Garzik,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
On Fri, 9 Jun 2006, Chase Venters wrote:
> On Fri, 9 Jun 2006, Linus Torvalds wrote:
>
>>
>>
>> On Fri, 9 Jun 2006, Alex Tomas wrote:
>> >
>> > would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
>>
>> Let's put it this way:
>> - have you had _any_ valid argument at all against "ext4"?
>>
>> Think about it. Honestly. Tell me anything that doesn't work?
>
> Now, granted, I really do agree with you about the whole code sharing thing.
> A fresh start is often just what you need. I'm just questioning if it
> wouldn't be better to do this fresh start immediately after going 48-bit,
> rather than before. That way, existing users that want that extra umph can
> have it today.
>
Let me clarify that I don't have a final answer or opinion for whether or
not 48-bit belongs in ext3 or ext4. But I'm trying to illustrate that it's an
important question to raise.
In Group A we have some number of users that must have 48-bit support by
Date B. 48-bit support could be available in ext3 by Date A, before Date
B. It could also be available in ext4 by Date X, along with a handful of
other features.
Is Date X before Date B? If it's not, is it worth telling Group A to
suffer for a while, or asking them to use ext4 before it's ready? These
are the questions I'd have to know the answers to if I were the one
casting a final decision.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 19:00 ` Chase Venters
@ 2006-06-10 13:33 ` Adrian Bunk
0 siblings, 0 replies; 295+ messages in thread
From: Adrian Bunk @ 2006-06-10 13:33 UTC (permalink / raw)
To: Chase Venters
Cc: Linus Torvalds, Alex Tomas, Andreas Dilger, Jeff Garzik,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
On Fri, Jun 09, 2006 at 02:00:15PM -0500, Chase Venters wrote:
> On Fri, 9 Jun 2006, Chase Venters wrote:
>
> >On Fri, 9 Jun 2006, Linus Torvalds wrote:
> >
> >>
> >>
> >> On Fri, 9 Jun 2006, Alex Tomas wrote:
> >>>
> >>> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
> >>
> >> Let's put it this way:
> >> - have you had _any_ valid argument at all against "ext4"?
> >>
> >> Think about it. Honestly. Tell me anything that doesn't work?
> >
> >Now, granted, I really do agree with you about the whole code sharing
> >thing. A fresh start is often just what you need. I'm just questioning if
> >it wouldn't be better to do this fresh start immediately after going
> >48-bit, rather than before. That way, existing users that want that extra
> >umph can have it today.
> >
>
> Let me clarify that I don't have a final answer or opinion for whether or
> not 48-bit belongs in ext3 or ext4. But I'm trying to illustrate that it's
> an important question to raise.
>
> In Group A we have some number of users that must have 48-bit support by
> Date B. 48-bit support could be available in ext3 by Date A, before Date
> B. It could also be available in ext4 by Date X, along with a handful of
> other features.
>
> Is Date X before Date B? If it's not, is it worth telling Group A to
> suffer for a while, or asking them to use ext4 before it's ready? These
> are the questions I'd have to know the answers to if I were the one
> casting a final decision.
There are many points mentioned in this discussion like:
- possibility of regressions for existing users
- time until the new code is actually stable and well-tested
- long-term maintainability
The faster availability is a point, but it's only one amongst many
points.
And it's not that we are talking about a feature not yet available in
Linux at all. Instead of suffering, couldn't the few people in urgent
need of 48-bit support use JFS or XFS?
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:50 ` [Ext2-devel] " Chase Venters
2006-06-09 19:00 ` Chase Venters
@ 2006-06-09 19:01 ` Jeff Garzik
2006-06-10 19:27 ` Kyle Moffett
2006-06-09 19:21 ` Alan Cox
2 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:01 UTC (permalink / raw)
To: Chase Venters
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
Chase Venters wrote:
> Now, granted, I really do agree with you about the whole code sharing
> thing. A fresh start is often just what you need. I'm just questioning
> if it wouldn't be better to do this fresh start immediately after going
> 48-bit, rather than before. That way, existing users that want that
> extra umph can have it today.
Then you continue to crap up the code with
if (48bit)
...
else
...
etc.
The proper way to do this is "cp -a ext3 ext4" (excluding JBD as Andrew
mentioned), and then let evolution take its course.
"Evolution" means the standard Linux developement -- patch the kernel,
patch e4fsprogs, test, lather rinse repeat. The best development
platform for new features is one that _works_, and keeps working.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:01 ` Jeff Garzik
@ 2006-06-10 19:27 ` Kyle Moffett
2006-06-10 19:44 ` Linus Torvalds
0 siblings, 1 reply; 295+ messages in thread
From: Kyle Moffett @ 2006-06-10 19:27 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Chase Venters,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger
On Jun 9, 2006, at 15:01:20, Jeff Garzik wrote:
> Chase Venters wrote:
>> Now, granted, I really do agree with you about the whole code
>> sharing thing. A fresh start is often just what you need. I'm just
>> questioning if it wouldn't be better to do this fresh start
>> immediately after going 48-bit, rather than before. That way,
>> existing users that want that extra umph can have it today.
>
> Then you continue to crap up the code with
>
> if (48bit)
> ...
> else
> ...
>
> etc.
>
> The proper way to do this is "cp -a ext3 ext4" (excluding JBD as
> Andrew mentioned), and then let evolution take its course.
Why not: "extX_ops.do_something_useful();", then have fs/ext/ext
{2,3,4}_ops.c which implement those various operations just like we
do for the Virtual Filesystem Switch? Much as there are
commonalities between all filesystems that get moved into the VFS;
perhaps we should have a Virtual Ext Filesystem Switch (VEFS?
VextFS?) which abstracts out the commonalities between the evolving
ext{2,3} code and data format? Such code would also provide a
library of common routines which could be used to implement other
specialized filesystems in the future. Imagine a cluster-extfs which
reuses some of the core extXfs code despite changing the on-disk
format considerably!
Cheers,
Kyle Moffett
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 19:27 ` Kyle Moffett
@ 2006-06-10 19:44 ` Linus Torvalds
2006-06-10 20:02 ` [Ext2-devel] " Linus Torvalds
0 siblings, 1 reply; 295+ messages in thread
From: Linus Torvalds @ 2006-06-10 19:44 UTC (permalink / raw)
To: Kyle Moffett
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger
On Sat, 10 Jun 2006, Kyle Moffett wrote:
>
> Why not: "extX_ops.do_something_useful();", then have fs/ext/ext{2,3,4}_ops.c
I think that kind of setup is hugely preferable to conditionals in the
code, if only because it tends to force people to do the abstractions
right, and make the code sequences independent.
I just don't think it's necessarily very realistic - it's _hard_ to
refactor code well. It also doesn't buy you hardly anything at all, since
the people who are interested in ext2 are usually not very interested in
sharing code with ext3. The filesystems simply aren't that similar, apart
from the layout.
ext2 is half the size of ext3, and that's ignoring JBD entirely.
That constant growth, btw, is one reason why splitting off legacy
filesystems is often a good idea. What do you want to bet that the 2000+
line difference RIGHT NOW in ext3/ext4 will grow in the future? Splitting
things off means that people who don't care about the new features can
just stay with a stable base and also avoid the bloat. Exactly the way you
can stay with ext2 on an old machine, and avoid the bloat of ext3.
There's also nothign that says that legacy filesystems cannot be
simplified. For example, it's perfectly realistic to say that ext3 (as a
legacy filesystem) doesn't support resizing, and simply ripping that part
out of it. The people who don't want the bloat will be happy. The people
who want the feature can move to ext4.
See? Splitting development is what allows you to make choices that you
simply otherwise don't _have_.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 19:44 ` Linus Torvalds
@ 2006-06-10 20:02 ` Linus Torvalds
2006-06-10 21:26 ` Theodore Tso
0 siblings, 1 reply; 295+ messages in thread
From: Linus Torvalds @ 2006-06-10 20:02 UTC (permalink / raw)
To: Kyle Moffett
Cc: Jeff Garzik, Chase Venters, Alex Tomas, Andreas Dilger,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
On Sat, 10 Jun 2006, Linus Torvalds wrote:
>
> ext2 is half the size of ext3, and that's ignoring JBD entirely.
Btw, let me say again that I'm fairly neutral on any particular individual
feature (ie the 48-bit thing doesn't actually move me all that much in
itself), but that from a maintenance standpoint, I think splitting off
filesystems and drivers has been a _huge_ success.
Starting from scratch - even if you literally start from the same
code-base - and allowing the old functionality to remain undisturbed is
just a very nice model. Yeah, yeah, it has some diskspace cost (although
at least from a git perspective, even that isn't really true), but we've
seen both in drivers and in filesystems how splitting things up has been a
great thing to do.
Sometimes it's a great thing just because five years later, it turns out
that nobody even uses the legacy thing, and you decide to at that point
just remove the driver (or filesystem, but so far it's never been the
case for filesystems even if smbfs is a potential victim of this in the
not _too_ distant future), because the new version simply does everything
better.
And that's _not_ a failure of the model. It's a success too. But so is the
above commentary on ext2, when the "old driver/filesystem is still used
and maintained by odd people". It's just two different possible outcomes
of the decision to do development separately from an older user base.
And again, I'd like to stress the _user_base_ over the _code_base_. In
many ways, that's the much more important split. I suspect Jeff has seen
this in drivers, where a lot of users simply do not want to have a new
driver, because it does some huge fundamental improvement for new users
but doesn't work for old ethernet cards, for example, because it missed
some old use case depended on a legacy feature that just doesn't fit well
into the new (and obviously improved) world-view.
So we've often seen a driver that _could_ have handled different versions
of the same card/chip split into an "old" and a "new" driver, and on the
whole it has always been positive - even if eventually the old driver just
becomes irrelevant for one reason or another.
Duplication isn't actually bad. It's what often allows experimentation,
and streamlining. In drivers, for example, duplication is _often_ done as
part of simply dropping support for old cards in the new version, but also
by dropping and simplifying the old driver that now has a much clearer
"raison d'etre", aka "user base".
Which gets me back to the whole "'user base' matters more than 'code
base'" argument, because it's literally the user base that determines
development (or lack of it - non-development is often the big reason for a
user base, as anybody who works for a distribution maintainer should know
intimately).
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 20:02 ` [Ext2-devel] " Linus Torvalds
@ 2006-06-10 21:26 ` Theodore Tso
2006-06-10 21:31 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 295+ messages in thread
From: Theodore Tso @ 2006-06-10 21:26 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, cmm, linux-fsdevel, Kyle Moffett, Alex Tomas,
Andreas Dilger
On Sat, Jun 10, 2006 at 01:02:26PM -0700, Linus Torvalds wrote:
> Starting from scratch - even if you literally start from the same
> code-base - and allowing the old functionality to remain undisturbed is
> just a very nice model. Yeah, yeah, it has some diskspace cost (although
> at least from a git perspective, even that isn't really true), but we've
> seen both in drivers and in filesystems how splitting things up has been a
> great thing to do.
>
> Sometimes it's a great thing just because five years later, it turns out
> that nobody even uses the legacy thing, and you decide to at that point
> just remove the driver (or filesystem, but so far it's never been the
> case for filesystems even if smbfs is a potential victim of this in the
> not _too_ distant future), because the new version simply does everything
> better.
So you you would be in OK of a model where we copy fs/ext3 to
"fs/ext4", and do development there which would merged rapidly into
mainline so that people who want to participate in testing can use
ext3dev, while people who want stability can use ext3 --- and at some
point, we remove the old ext3 entirely and let fs/ext4 register itself
as both the ext3 and ext4 filesystem, and at some point in the future,
remove the ext3 name entirely?
If that allows us to make forward progress and stop the
flamewar, I'm willing to go along with it --- although e2fsprogs will
continue to support ext2/3/4, and ext4 will have backwards
compatibility support for ext3 formats (we can look at better ways of
refactoring code to make it cleaner, if people don't like the current
conditions). There are some real advantages to the system, especially
if we can get changed merged into mainline for ext4 more quickly while
it is under development and declared to be unstable (we can put it
under CONFIG_EXPERIMENTAL if people really want).
As far as people who want to use ext3 as the beginning point
to do something that is has no forwards- compatibility, there's
nothing stopping them from creating a jgarzikfs if they want. But I
think I can speak for most of the ext3 development community that we
feel that one of the strengths of ext2/3 is its ability to do smooth
upgrades (and in many cases, downgrades as well, when people need to
migrate a filesystem so it can be mounted on older kernels), and that
it's one of the reasons why ext3 has been more succesful, than say,
JFS.
I do think there is plenty of room for competition, and I'm
certainly looking forward to the brainstorming at next week's
filesystem workshop. But ext2/3 has been pretty successful for over
ten years given a certain development model and philosophy, and I for
one am interested how much farther we can take it. Remember when
academics were saying that Linux was an obsolete design and
Microkernels was where it's at? If we had given up 15 years ago when
Prof. Tennenbaum had said it, where would we be?
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 21:26 ` Theodore Tso
@ 2006-06-10 21:31 ` Linus Torvalds
2006-06-10 22:12 ` Jeff Garzik
2006-06-10 22:21 ` Jeff Garzik
2 siblings, 0 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-10 21:31 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, cmm, linux-fsdevel, Kyle Moffett, Alex Tomas,
Andreas Dilger
On Sat, 10 Jun 2006, Theodore Tso wrote:
>
> So you you would be in OK of a model where we copy fs/ext3 to
> "fs/ext4", and do development there which would merged rapidly into
> mainline so that people who want to participate in testing can use
> ext3dev, while people who want stability can use ext3
Absolutely.
> --- and at some
> point, we remove the old ext3 entirely and let fs/ext4 register itself
> as both the ext3 and ext4 filesystem, and at some point in the future,
> remove the ext3 name entirely?
Maybe, and maybe not. That depends on where ext4 is when the thing calms
down.
Look at what happened to ext2. Would you seriously suggest removing it
just because ext3 does more than ext2 does?
And yes, if I recall correctly, all the same ext2 people were against the
whole ext2->ext3 split also, which we did for the same reason - I and
others refused to let people "hack on" the standard stable filesystem.
Yet I don't see anybody in this discussion saying "I admit I was wrong
back then - the split was correct". Hmm. I wonder where those people went?
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 21:26 ` Theodore Tso
2006-06-10 21:31 ` Linus Torvalds
@ 2006-06-10 22:12 ` Jeff Garzik
2006-06-10 22:21 ` Jeff Garzik
2 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 22:12 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, ext2-devel, linux-kernel, Chase Venters,
Linus Torvalds, cmm, linux-fsdevel, Kyle Moffett, Alex Tomas,
Andreas Dilger
Theodore Tso wrote:
> As far as people who want to use ext3 as the beginning point
> to do something that is has no forwards- compatibility, there's
> nothing stopping them from creating a jgarzikfs if they want. But I
> think I can speak for most of the ext3 development community that we
> feel that one of the strengths of ext2/3 is its ability to do smooth
> upgrades (and in many cases, downgrades as well, when people need to
> migrate a filesystem so it can be mounted on older kernels), and that
> it's one of the reasons why ext3 has been more succesful, than say,
> JFS.
When did I ever say smooth upgrades were a bad idea?
The whole point of 'cp -a ext3 ext4' is to ensure smooth upgrades
continue. A key theme is to avoid -backporting- all this new stuff
that's going into ext4. IMO ext3 shouldn't be a devel platform at this
point in its lifecycle.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 21:26 ` Theodore Tso
2006-06-10 21:31 ` Linus Torvalds
2006-06-10 22:12 ` Jeff Garzik
@ 2006-06-10 22:21 ` Jeff Garzik
2006-06-11 4:39 ` Stable/devel policy - was Re: [Ext2-devel] " Neil Brown
2 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 22:21 UTC (permalink / raw)
To: Theodore Tso, Linus Torvalds, Kyle Moffett, Jeff Garzik,
Chase Venters, Alex Tomas, Andreas Dilger, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
Theodore Tso wrote:
> So you you would be in OK of a model where we copy fs/ext3 to
> "fs/ext4", and do development there which would merged rapidly into
> mainline so that people who want to participate in testing can use
> ext3dev, while people who want stability can use ext3 --- and at some
> point, we remove the old ext3 entirely and let fs/ext4 register itself
> as both the ext3 and ext4 filesystem, and at some point in the future,
> remove the ext3 name entirely?
Yep, and in addition I would argue that you can take the opportunity to
make ext4 default to extents-enabled, and some similar behavior changes
(dir_index default?). The existence of both ext3 and ext4 means you can
be more aggressive in turning on stuff, IMO.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Stable/devel policy - was Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 22:21 ` Jeff Garzik
@ 2006-06-11 4:39 ` Neil Brown
2006-06-11 5:19 ` Stable/devel policy - was " Linus Torvalds
2006-06-13 0:28 ` Stable/devel policy - was Re: [Ext2-devel] " Mingming Cao
0 siblings, 2 replies; 295+ messages in thread
From: Neil Brown @ 2006-06-11 4:39 UTC (permalink / raw)
To: Jeff Garzik
Cc: Theodore Tso, Linus Torvalds, Kyle Moffett, Chase Venters,
Alex Tomas, Andreas Dilger, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel
On Saturday June 10, jeff@garzik.org wrote:
> Theodore Tso wrote:
> > So you you would be in OK of a model where we copy fs/ext3 to
> > "fs/ext4", and do development there which would merged rapidly into
> > mainline so that people who want to participate in testing can use
> > ext3dev, while people who want stability can use ext3 --- and at some
> > point, we remove the old ext3 entirely and let fs/ext4 register itself
> > as both the ext3 and ext4 filesystem, and at some point in the future,
> > remove the ext3 name entirely?
>
> Yep, and in addition I would argue that you can take the opportunity to
> make ext4 default to extents-enabled, and some similar behavior changes
> (dir_index default?). The existence of both ext3 and ext4 means you can
> be more aggressive in turning on stuff, IMO.
>
> Jeff
I'm wondering what all this has to say about general principles of
sub-project development with the Linux kernel.
There is a strong tradition of software projects having a 'stable'
branch and a 'development' branch, and having both available and both
receiving bug fixes (at least) so that users can choose what best
suits their needs.
Due to the (quite appropriate) lack of a stable API for kernel
modules, it isn't really practical (and definitely isn't encouraged)
to distribute kernel-modules separately. This seems to suggest that
if we want a 'stable' and a 'devel' branch of a project, both branches
need to be distributed as part of the same kernel tree.
Apart from ext2/3 - and maybe reiserfs - there doesn't seem to be much
evidence of this happening. Why is that?
- is -mm enough? It seems to be enough for small updates, but
doesn't seem to be enough for more major projects. How long
have the ext3 patches been in -mm?? (I cannot actually seem
to find them there at all)
- is there lots of -devel code slipping in to the 'stable' tree, thus
resulting in a kernel.org tree that is permanently unstable (in
which case there should be no objection to the new ext3 code -
leave it to distros to keep it out until it is stable).
- are we just not innovating as much as we could be and so don't
need a -devel? Is ext3 the only site of major innovation?
Seems unlikely.
It seems a bit rough to insist that the ext-fs fork every so-often,
but not impose similar requirements on other sections of code.
So: what would you (collectively) suggest should be the policy for
managing substantial innovation within Linux subsystems? And how
broadly should it be applied?
NeilBrown
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: Stable/devel policy - was Re: [RFC 0/13] extents and 48bit ext3
2006-06-11 4:39 ` Stable/devel policy - was Re: [Ext2-devel] " Neil Brown
@ 2006-06-11 5:19 ` Linus Torvalds
2006-06-11 7:32 ` Ingo Molnar
2006-06-13 0:28 ` Stable/devel policy - was Re: [Ext2-devel] " Mingming Cao
1 sibling, 1 reply; 295+ messages in thread
From: Linus Torvalds @ 2006-06-11 5:19 UTC (permalink / raw)
To: Neil Brown
Cc: Andrew Morton, Theodore Tso, Jeff Garzik, ext2-devel,
linux-kernel, Chase Venters, cmm, linux-fsdevel, Kyle Moffett,
Alex Tomas, Andreas Dilger
On Sun, 11 Jun 2006, Neil Brown wrote:
>
> I'm wondering what all this has to say about general principles of
> sub-project development with the Linux kernel.
Yes. That's an interesting and relevant tangent.
> Due to the (quite appropriate) lack of a stable API for kernel
> modules, it isn't really practical (and definitely isn't encouraged)
> to distribute kernel-modules separately. This seems to suggest that
> if we want a 'stable' and a 'devel' branch of a project, both branches
> need to be distributed as part of the same kernel tree.
>
> Apart from ext2/3 - and maybe reiserfs - there doesn't seem to be much
> evidence of this happening. Why is that?
I think part of it is "expense". It's pretty expensive to maintain on a
bigger scale. For example, you mention "-mm", and there's no question that
it's _very_ expensive to do that (ie you basically need a very respected
person who must be spending a fair amount of effort and time on it).
Even in this case, I think a large argument has been that ext3 itself
isn't getting a lot of active development outside of the suggested ext4
effort, so the "expense" there is literally just the copying of the files.
That works ok for a filesystem every once in a while, but it wouldn't
scale to _everybody_ doing it often.
Also, in order for it to work at all, it obviously needs to be a part of
the kernel that -can- be duplicated. That pretty much means "filesystem"
or "device driver". Other parts aren't as amenable to having multiple
concurrent versions going on at the same time (although it clearly does
happen: look at the IO schedulers, where a large reason for the pluggable
IO scheduler model was to allow multiple independent schedulers exactly so
that people _could_ do different ones in parallel).
People have obviously suggested pluggable CPU schedulers too, and even
more radically pluggable VM modules (not that long ago).
> It seems a bit rough to insist that the ext-fs fork every so-often,
> but not impose similar requirements on other sections of code.
Well, as mentioned, it's actually quite common in drivers. It's clearly
not the _main_ development model, but it's happened several times in
almost every single driver subsystem (ie SCSI drivers, video drivers,
network drivers, USB, IDE, have _all_ seen "duplicated" drivers where
somebody just decided to do things differently, and rather than extend an
existing driver, do an alternate one).
So it's not like this is _exceptional_. It happens all the time. It
obviously happens less than normal development (we couldn't fork things
every time something changes), but it's not unheard of, or even rare.
> So: what would you (collectively) suggest should be the policy for
> managing substantial innovation within Linux subsystems? And how
> broadly should it be applied?
I think the interesting point is how we're moving away from the "global
development" model (ie everything breaks at the same time between 2.4.x
and 2.6.x), and how the fact that we're trying to maintain a more stable
situation may well mean that we'll see more of the "local development"
model where a specific subsystem goes through a development series, but
where stability requirements mean that we must not allow it to disturb
existing users.
And even more interestingly (at least to me), the question might become
one of "how does that affect the tools and build and configuration
infrastructure", and just the general flow of development.
I don't think one or two filesystems (and a few drivers) splitting is
anythign new, but if this ends up becoming _more_ common, maybe that
implies a new model entirely..
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: Stable/devel policy - was Re: [RFC 0/13] extents and 48bit ext3
2006-06-11 5:19 ` Stable/devel policy - was " Linus Torvalds
@ 2006-06-11 7:32 ` Ingo Molnar
0 siblings, 0 replies; 295+ messages in thread
From: Ingo Molnar @ 2006-06-11 7:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Theodore Tso, Jeff Garzik, Neil Brown, ext2-devel,
linux-kernel, Chase Venters, cmm, linux-fsdevel, Kyle Moffett,
Alex Tomas, Andreas Dilger
* Linus Torvalds <torvalds@osdl.org> wrote:
> And even more interestingly (at least to me), the question might
> become one of "how does that affect the tools and build and
> configuration infrastructure", and just the general flow of
> development.
>
> I don't think one or two filesystems (and a few drivers) splitting is
> anythign new, but if this ends up becoming _more_ common, maybe that
> implies a new model entirely..
at least for core kernel stuff, it's hard to split things in any
manageable way (as you mentioned it as well) - so higher flux is
inevitable.
So what i've been focusing on more in the past year or so is to enable
the core kernel to take more development flux, via kernel features.
Instead of adding more features to the kernel, i'm quite interested in
seeing more technologies that make a higher development flux safer: to
make the kernel more debuggable, to make bugs more reportable for users,
to make the effects of bugs less harmful, and to make the kernel itself
notice more bugs by itself.
To be able to handle a higher development flux in core code, i think we
need the following policies wrt. core kernel changes:
- More code consolidation between architectures and subsystems.
Core kernel changes impact "non-mainstream" architectures the most -
while some of our best technologies root from non-mainstream
technologies. So it's a net loss to only concentrate on the
mainstream, because developer and technology distribution does not
follow user distribution.
The generic irq subsystem, spinlock and semaphore/mutex consolidation
are all efforts in this direction. I consider the Generic Time Of Day
(GTOD) effort a similarly important item, for the same reasons. There
are other good examples too, for example klibc is a good step towards
a more consolidated boot process. The Xen subarch work triggers
consolidation too - etc. Andrew's policy of "you must not break _any_
architecture in -mm" is very important too.
And we should do consolidation even in cases where there's some
minimal runtime cost. Being able to handle higher flux is more
important than getting the last cycle out of the system. This does
not mean we should reject patches that do get those last cycles, this
only means we should not reject consolidation patches on the grounds
that they _lose_ a few cycles. I dont think this is a common problem
for consolidation projects right now - but it could happen in the
future.
- Even more cleanups.
We always preferred cleanups but it now becomes critical: i strongly
believe that cleanups must take precedence over feature work. [with a
few rare and temporary exceptions perhaps, like hardware-enablement
or really critical features.] It's much easier to spot bugs in clean
code, plus it's much easier for automated correctness validators to
find bugs in clean code.
(My own examples here include spinlock-init cleanups, which directly
enabled things like the lock validator. But pure code cleanups apply
too. )
- More automated correctness-checking tools and kernel features.
While the preferred mode of avoiding bugs should be a clean
design and clean code, higher flux introduces higher noise and bugs
are inevitable. So the importance of automated tools (both static and
dynamic analysis) increased.
Sparse annotations are one good example. My own examples here are the
lock validator, the mutex debugging code, the consolidated
spinlock debugging code. Some of these are direct feature-enablers:
for example the smp_processor_id() debugging code directly enabled a
safe and painless migration to PREEMPT_BKL. One nice feature in the
works that can find hard-to-spot bugs is kmemleak.
- Coding style police!
With higher development flux it is becoming even more important for
kernel developers to review other developer's work. But that is very
hard if the coding style varies too much. This is a fundamentally
human problem, and the only sane solution is brutal: the _strict_
Linus coding style must be used in all high-flux subsystems.
- More debuggability, reportability.
In this area we still suck quite a bit, and this affects userspace
too: currently we have nothing equivalent to things like Dr Watson,
in Linux most of the info about the first userspace crash almost
always gets lost! (and even afterwards, once debug packages are
downloaded and the app is run in gdb, it's still too painful for the
user, so we lose lots of feedback.)
Some of the GUIs try to do something about this and automate crash
reporting, but it doesnt cover most of the app crashes and userspace
clearly needs kernel help, because ptrace is too inflexible for this
purpose. (help is on the way though, there's a next-gen ptrace
project that solves these problems very cleanly.)
There are a number of important projects going on in this area - for
example the dwarf unwinder for x86_64 to improve the quality of
kernel oopses, and kgdb (or bits of NLKD) if it gets clean enough.
my own impression is that things are going in the right direction, but
that there should be more awareness of these principles. I think if we
add a couple of more key technologies then we can take the higher kernel
development flux just fine, without compromising quality. Even though
Linux has lots of developers, we should be more economic with that
development power and should waste less of that on unnecessarily complex
debugging tasks.
I do consider the forking of a subsystem the "easy way out" - the hard
and more correct approach is i think to turn every drastic rewrite into
small manageable steps. That's much easier said than done, and it's
sometimes 10 times the work but it's alot safer - and the end result is
often wildly different (and alot cleaner!) from what one would do via a
drastic rewrite. A dumb 'cp -a' copying of a subsystem will preserve
most of the legacies and architectural inefficiencies. Even an
intelligent drastic rewrite preserves most of the legacies - there's
just so much of change users can take at once, and _eventually_ a new
subsystem has to be exposed to real users - at which point the
compatibility constraints apply again. I have yet to see a single case
of hard physical necessity to throw away an old subsystem due to
legacies. I think the prime example to follow is how Al Viro works -
he's beein maintaining the VFS for many years without having to
duplicate functionality, without breaking the world, but he still
managed to turn the VFS upside down, inside out, in small, manageable
steps. It _is_ possible in almost every case, for all but the most
spaghetti pieces of code.
Ingo
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: Stable/devel policy - was Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-11 4:39 ` Stable/devel policy - was Re: [Ext2-devel] " Neil Brown
2006-06-11 5:19 ` Stable/devel policy - was " Linus Torvalds
@ 2006-06-13 0:28 ` Mingming Cao
1 sibling, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-13 0:28 UTC (permalink / raw)
To: Neil Brown
Cc: Jeff Garzik, Theodore Tso, Linus Torvalds, Kyle Moffett,
Chase Venters, Alex Tomas, Andreas Dilger, Andrew Morton,
ext2-devel, linux-kernel, linux-fsdevel
On Sun, 2006-06-11 at 14:39 +1000, Neil Brown wrote:
> On Saturday June 10, jeff@garzik.org wrote:
> > Theodore Tso wrote:
> > > So you you would be in OK of a model where we copy fs/ext3 to
> > > "fs/ext4", and do development there which would merged rapidly into
> > > mainline so that people who want to participate in testing can use
> > > ext3dev, while people who want stability can use ext3 --- and at some
> > > point, we remove the old ext3 entirely and let fs/ext4 register itself
> > > as both the ext3 and ext4 filesystem, and at some point in the future,
> > > remove the ext3 name entirely?
> >
> > Yep, and in addition I would argue that you can take the opportunity to
> > make ext4 default to extents-enabled, and some similar behavior changes
> > (dir_index default?). The existence of both ext3 and ext4 means you can
> > be more aggressive in turning on stuff, IMO.
> >
> > Jeff
>
> I'm wondering what all this has to say about general principles of
> sub-project development with the Linux kernel.
>
> There is a strong tradition of software projects having a 'stable'
> branch and a 'development' branch, and having both available and both
> receiving bug fixes (at least) so that users can choose what best
> suits their needs.
>
> Due to the (quite appropriate) lack of a stable API for kernel
> modules, it isn't really practical (and definitely isn't encouraged)
> to distribute kernel-modules separately. This seems to suggest that
> if we want a 'stable' and a 'devel' branch of a project, both branches
> need to be distributed as part of the same kernel tree.
>
> Apart from ext2/3 - and maybe reiserfs - there doesn't seem to be much
> evidence of this happening. Why is that?
>
> - is -mm enough? It seems to be enough for small updates, but
> doesn't seem to be enough for more major projects. How long
> have the ext3 patches been in -mm?? (I cannot actually seem
> to find them there at all)
>
To clarify, the first 4 patches of the series are bug fixes for both 32
bit ext3 (with current on-disk layout) and 48 bit ext3(extents based),
they are in mm tree now. The rest of the patches 5-13 to support 48 bit
ext3 based on extents are not in mm tree.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:50 ` [Ext2-devel] " Chase Venters
2006-06-09 19:00 ` Chase Venters
2006-06-09 19:01 ` Jeff Garzik
@ 2006-06-09 19:21 ` Alan Cox
2006-06-09 19:13 ` [Ext2-devel] " Chase Venters
2006-06-09 19:24 ` Alex Tomas
2 siblings, 2 replies; 295+ messages in thread
From: Alan Cox @ 2006-06-09 19:21 UTC (permalink / raw)
To: Chase Venters
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger
Ar Gwe, 2006-06-09 am 13:50 -0500, ysgrifennodd Chase Venters:
> It's about bundling. It's about being able to take your 3-year old
> dependable car and make it faster by bolting on new manifolds and
> turbochargers, rather than waiting a year for the manufacturer to release
> a totally new model
Unfortunately in the software case if you want it in the base kernel you
are bolting new manifolds on everyones car at once, and someone is going
to have an engine explode as a result.
Ext3 already has enough back compatiblity that you can replace the
engine with a horse, we don't need any more in it thank you.
Alan
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 19:21 ` Alan Cox
@ 2006-06-09 19:13 ` Chase Venters
2006-06-09 19:24 ` Alex Tomas
1 sibling, 0 replies; 295+ messages in thread
From: Chase Venters @ 2006-06-09 19:13 UTC (permalink / raw)
To: Alan Cox
Cc: Chase Venters, Linus Torvalds, Alex Tomas, Andreas Dilger,
Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel, cmm,
linux-fsdevel
On Fri, 9 Jun 2006, Alan Cox wrote:
> Ar Gwe, 2006-06-09 am 13:50 -0500, ysgrifennodd Chase Venters:
>> It's about bundling. It's about being able to take your 3-year old
>> dependable car and make it faster by bolting on new manifolds and
>> turbochargers, rather than waiting a year for the manufacturer to release
>> a totally new model
>
> Unfortunately in the software case if you want it in the base kernel you
> are bolting new manifolds on everyones car at once, and someone is going
> to have an engine explode as a result.
Someone _could_ have an engine explode... it's perfectly possible though
that a well-tested 48-bit patch wouldn't cause anyone's ext3 to explode.
(After all, the vehicle analogy breaks down here - software doesn't get
worn out from being run at redline for too many miles.)
> Ext3 already has enough back compatiblity that you can replace the
> engine with a horse, we don't need any more in it thank you.
But just what are the costs at calling it quits now? Are we going to deny
users something they need?
>
> Alan
>
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:21 ` Alan Cox
2006-06-09 19:13 ` [Ext2-devel] " Chase Venters
@ 2006-06-09 19:24 ` Alex Tomas
2006-06-09 19:25 ` Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 19:24 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
>>>>> Alan Cox (AC) writes:
AC> Unfortunately in the software case if you want it in the base kernel you
AC> are bolting new manifolds on everyones car at once, and someone is going
AC> to have an engine explode as a result.
please, don't forget you need to enable it by mount option.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:24 ` Alex Tomas
@ 2006-06-09 19:25 ` Jeff Garzik
2006-06-09 19:35 ` Alex Tomas
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:25 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, ext2-devel, linux-kernel, Chase Venters,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger, Alan Cox
Alex Tomas wrote:
>>>>>> Alan Cox (AC) writes:
>
> AC> Unfortunately in the software case if you want it in the base kernel you
> AC> are bolting new manifolds on everyones car at once, and someone is going
> AC> to have an engine explode as a result.
>
> please, don't forget you need to enable it by mount option.
Irrelevant. That's a development-only situation. It will be enabled by
default eventually, and should be considered in that light.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:25 ` Jeff Garzik
@ 2006-06-09 19:35 ` Alex Tomas
2006-06-09 19:35 ` [Ext2-devel] " Jeff Garzik
` (2 more replies)
0 siblings, 3 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 19:35 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Chase Venters,
Linus Torvalds, cmm, linux-fsdevel, Alex Tomas, Andreas Dilger,
Alan Cox
>>>>> Jeff Garzik (JG) writes:
JG> Irrelevant. That's a development-only situation. It will be enabled
JG> by default eventually, and should be considered in that light.
that's your point of view. mine is that this option (and code)
to be used only when needed.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 19:35 ` Alex Tomas
@ 2006-06-09 19:35 ` Jeff Garzik
2006-06-09 20:44 ` Joel Becker
2006-06-11 20:14 ` [Ext2-devel] " grundig
2 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:35 UTC (permalink / raw)
To: Alex Tomas
Cc: Alan Cox, Chase Venters, Linus Torvalds, Andreas Dilger,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> Irrelevant. That's a development-only situation. It will be enabled
> JG> by default eventually, and should be considered in that light.
>
> that's your point of view. mine is that this option (and code)
> to be used only when needed.
Regardless of any use "when needed," the code is in the codebase, and is
thus the "if (metadata_v2) ... else ..." maintenance burden that has
been discussed.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:35 ` Alex Tomas
2006-06-09 19:35 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 20:44 ` Joel Becker
2006-06-09 20:49 ` Alex Tomas
2006-06-11 20:14 ` [Ext2-devel] " grundig
2 siblings, 1 reply; 295+ messages in thread
From: Joel Becker @ 2006-06-09 20:44 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger,
Alan Cox
On Fri, Jun 09, 2006 at 11:35:43PM +0400, Alex Tomas wrote:
> that's your point of view. mine is that this option (and code)
> to be used only when needed.
Which is irrelevant. If you tell the world "extents are
better!", they're going to turn them on regardless of whether you
consider their situation a good candidate. Many non-kernel-hackers
started using reiserfs before it was usably stable, just because
"journaling is better!"
Joel
--
Life's Little Instruction Book #347
"Never waste the oppourtunity to tell someone you love them."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:44 ` Joel Becker
@ 2006-06-09 20:49 ` Alex Tomas
2006-06-09 21:11 ` Joel Becker
0 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 20:49 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger,
Alan Cox
>>>>> Joel Becker (JB) writes:
JB> On Fri, Jun 09, 2006 at 11:35:43PM +0400, Alex Tomas wrote:
>> that's your point of view. mine is that this option (and code)
>> to be used only when needed.
JB> Which is irrelevant. If you tell the world "extents are
JB> better!", they're going to turn them on regardless of whether you
JB> consider their situation a good candidate. Many non-kernel-hackers
JB> started using reiserfs before it was usably stable, just because
JB> "journaling is better!"
I haven't said that so far. I feel absolutely comfortable to put
as many warnings as needed from your point of view.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:49 ` Alex Tomas
@ 2006-06-09 21:11 ` Joel Becker
2006-06-09 21:20 ` Alex Tomas
0 siblings, 1 reply; 295+ messages in thread
From: Joel Becker @ 2006-06-09 21:11 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger,
Alan Cox
On Sat, Jun 10, 2006 at 12:49:54AM +0400, Alex Tomas wrote:
> >>>>> Joel Becker (JB) writes:
> JB> Which is irrelevant. If you tell the world "extents are
> JB> better!", they're going to turn them on regardless of whether you
> JB> consider their situation a good candidate. Many non-kernel-hackers
> JB> started using reiserfs before it was usably stable, just because
> JB> "journaling is better!"
>
> I haven't said that so far. I feel absolutely comfortable to put
> as many warnings as needed from your point of view.
When I say "you", I mean the general consensus. You can scream
"don't do this" as loud as you want, the world might drown you out. Not
every random person that sees "new extents in ext3" is going to know
that Alex is the authority. They certainly aren't going to read the
documentation. They'll read some comment on some website that says "all
you need is '-o extents'!"
Joel
--
"A narcissist is someone better looking than you are."
- Gore Vidal
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:11 ` Joel Becker
@ 2006-06-09 21:20 ` Alex Tomas
2006-06-09 21:29 ` Joel Becker
0 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 21:20 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, cmm, linux-fsdevel, Andreas Dilger, Alan Cox
>>>>> Joel Becker (JB) writes:
JB> When I say "you", I mean the general consensus. You can scream
JB> "don't do this" as loud as you want, the world might drown you out. Not
JB> every random person that sees "new extents in ext3" is going to know
JB> that Alex is the authority. They certainly aren't going to read the
JB> documentation. They'll read some comment on some website that says "all
JB> you need is '-o extents'!"
two point here:
a) warnings should be made visible at mount time,
something like printk(KERN_CRIT ...)
b) I don't think you're going to fight all crazy people in the world,
they'll definitely find a way to break something:
data or something else.
thanks, Alex
PS. in the end, "extents" option affects *new* files only. and one
can boot extents-enabled kernel and convert fs back.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:20 ` Alex Tomas
@ 2006-06-09 21:29 ` Joel Becker
2006-06-09 21:33 ` Alex Tomas
2006-06-09 21:43 ` Joel Becker
0 siblings, 2 replies; 295+ messages in thread
From: Joel Becker @ 2006-06-09 21:29 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, cmm, linux-fsdevel, Andreas Dilger, Alan Cox
On Sat, Jun 10, 2006 at 01:20:31AM +0400, Alex Tomas wrote:
> two point here:
> a) warnings should be made visible at mount time,
> something like printk(KERN_CRIT ...)
Too late, they're already broken!
> b) I don't think you're going to fight all crazy people in the world,
> they'll definitely find a way to break something:
> data or something else.
Certainly not the crazy people. But the random person who's
just humming along? We should be nice to them.
> PS. in the end, "extents" option affects *new* files only. and one
> can boot extents-enabled kernel and convert fs back.
I just mentioned to Ted in another mail, since this is a
"permanent" change to the on-disk structure, why is this a mount option?
Shouldn't it rather be a tunefs(8)/mkfs(8) option?
In general, anything you pass to "mount -o" is optional. You
can mount with option X, then unmount and mount without option X. Most
people "expect" this to work (Principle of Least Surprise). So, when
you do:
# mount -o extents /fs1
# create_file /fs1/newfile
# umount /fs1
# mount /fs1
it breaks. Lease Surprise expects it to work.
However, tunefs(8) and mkfs(8) is generally understood to make
physical changes. Why not "tunefs -extents" to turn them on? It's
completely analogous to "tunefs -J", will fit everyone's expectation,
and won't surprise people. "mkfs -extents" does the same thing.
Joel
--
Life's Little Instruction Book #232
"Keep your promises."
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:29 ` Joel Becker
@ 2006-06-09 21:33 ` Alex Tomas
2006-06-09 21:43 ` Joel Becker
1 sibling, 0 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 21:33 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Chase Venters, cmm, linux-fsdevel, Andreas Dilger, Alan Cox
>>>>> Joel Becker (JB) writes:
JB> On Sat, Jun 10, 2006 at 01:20:31AM +0400, Alex Tomas wrote:
>> two point here:
>> a) warnings should be made visible at mount time,
>> something like printk(KERN_CRIT ...)
JB> Too late, they're already broken!
not at mount time, only upon first file creation.
thanks, Alex
PS. need to think about your tune2fs/mke2fs proposal ... thanks.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:29 ` Joel Becker
2006-06-09 21:33 ` Alex Tomas
@ 2006-06-09 21:43 ` Joel Becker
1 sibling, 0 replies; 295+ messages in thread
From: Joel Becker @ 2006-06-09 21:43 UTC (permalink / raw)
To: Alex Tomas, Jeff Garzik, Alan Cox, Chase Venters, Andreas Dilger,
Andrew Morton, ext2-devel, linux-kernel, cmm, linux-fsdevel
On Fri, Jun 09, 2006 at 02:29:05PM -0700, Joel Becker wrote:
> However, tunefs(8) and mkfs(8) is generally understood to make
> physical changes. Why not "tunefs -extents" to turn them on? It's
> completely analogous to "tunefs -J", will fit everyone's expectation,
> and won't surprise people. "mkfs -extents" does the same thing.
Heck, if you have code to convert extents back to regular ext3,
"tunefs -noextents" works and is properly symmetric.
Joel
--
"The nice thing about egotists is that they don't talk about other
people."
- Lucille S. Harper
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 19:35 ` Alex Tomas
2006-06-09 19:35 ` [Ext2-devel] " Jeff Garzik
2006-06-09 20:44 ` Joel Becker
@ 2006-06-11 20:14 ` grundig
2006-06-14 16:45 ` Alex Tomas
2 siblings, 1 reply; 295+ messages in thread
From: grundig @ 2006-06-11 20:14 UTC (permalink / raw)
To: Alex Tomas
Cc: jeff, alex, alan, chase.venters, torvalds, adilger, akpm,
ext2-devel, linux-kernel, cmm, linux-fsdevel
El Fri, 09 Jun 2006 23:35:43 +0400,
Alex Tomas <alex@clusterfs.com> escribió:
> >>>>> Jeff Garzik (JG) writes:
>
> JG> Irrelevant. That's a development-only situation. It will be enabled
> JG> by default eventually, and should be considered in that light.
>
> that's your point of view. mine is that this option (and code)
> to be used only when needed.
Distros may ignore your opinion and may enable it, and users won't know
that it's enabled or even if such feature exist - until they try to run
an older kernel. If almost nobody needs this feature, why not avoid
problems by not merging it and maintaining it separated from the
main tree?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-11 20:14 ` [Ext2-devel] " grundig
@ 2006-06-14 16:45 ` Alex Tomas
0 siblings, 0 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-14 16:45 UTC (permalink / raw)
To: grundig
Cc: akpm, jeff, ext2-devel, linux-kernel, chase.venters, cmm,
linux-fsdevel, Alex Tomas, adilger, alan
>>>>> grundig (g) writes:
g> Distros may ignore your opinion and may enable it, and users won't know
g> that it's enabled or even if such feature exist - until they try to run
g> an older kernel. If almost nobody needs this feature, why not avoid
g> problems by not merging it and maintaining it separated from the
g> main tree?
not sure, in such a distro such an user will be aware he's using ext4.
about "nobody needs ...": see my question regarding NUMA in kernel.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:38 ` Linus Torvalds
2006-06-09 18:50 ` [Ext2-devel] " Chase Venters
@ 2006-06-09 19:22 ` Alex Tomas
2006-06-09 19:22 ` Jeff Garzik
2006-06-09 20:16 ` Andreas Dilger
2 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 19:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
IMHO ...
the main reason is that ext4 would be treated as a new generation
fs which will be used for lots of new features probably. and it
will take long to get into production-ready state. at the same
time, proposed patches (at least extents itself) are heavily
tested in production and could be made available for our users
very soon.
thanks, Alex
>>>>> Linus Torvalds (LT) writes:
LT> On Fri, 9 Jun 2006, Alex Tomas wrote:
>>
>> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
LT> Let's put it this way:
LT> - have you had _any_ valid argument at all against "ext4"?
LT> Think about it. Honestly. Tell me anything that doesn't work?
LT> Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:22 ` Alex Tomas
@ 2006-06-09 19:22 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:22 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
> the main reason is that ext4 would be treated as a new generation
> fs which will be used for lots of new features probably. and it
> will take long to get into production-ready state. at the same
> time, proposed patches (at least extents itself) are heavily
> tested in production and could be made available for our users
> very soon.
No -- that's a bad way to develop it, and a good way to ensure it will
never get stable.
You want to start from a known good point, and keep it working.
Standard iterative development model.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:38 ` Linus Torvalds
2006-06-09 18:50 ` [Ext2-devel] " Chase Venters
2006-06-09 19:22 ` Alex Tomas
@ 2006-06-09 20:16 ` Andreas Dilger
2006-06-09 20:31 ` Linus Torvalds
2006-06-09 20:31 ` Jeff Garzik
2 siblings, 2 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 20:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas
On Jun 09, 2006 11:38 -0700, Linus Torvalds wrote:
> On Fri, 9 Jun 2006, Alex Tomas wrote:
> > would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
>
> Let's put it this way:
> - have you had _any_ valid argument at all against "ext4"?
>
> Think about it. Honestly. Tell me anything that doesn't work?
It's funny that everyone is arguing to fork ext3 into ext4, for a feature
that will primarily allow it to work with large disks (that are already
here, not some wacky pipe dream of featuritis as Jeff thinks). Yet the
same people that are advocating code duplication on a massive scale in
ext4 are against 5 lines of duplication between the VFS and a filesystem,
or in a couple of drivers here and there.
Having two copies of ext3 means we immediately get 2x the bugs, and
no guarantee that they will ever be fixed in ext3 (all of the ext3
maintainers will be solidly on the ext4 bandwagon if it comes to that).
It also means that two virtually identical copies of the same code
will be in memory at the same time (one for ext3 and another for ext4)
polluting the cache, even though some developers complain that a single
EXPORT_SYMBOL is "bloating" the kernel. This also means two inode slabs
causing memory fragmentation, etc.
The other issue is that adding a new "ext4" filesystem type will cause
userspace tools to break that assume they know something about the
filesystem type. They will all detect the filesystem as "ext3" and try
to mount it as such, when the required kernel filesystem is ext4. Or
we will need to have "mkfs.ext4", "fsck.ext4", etc, for no particular
reason.
Either a system upgrades totally to ext4 to avoid the duplication of code
in memory (and breaks ALL backward compatibility, for no good reason), or
it lives with "only mount filesystems as 'ext4' when they need it" and the
code will rarely be used.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:16 ` Andreas Dilger
@ 2006-06-09 20:31 ` Linus Torvalds
2006-06-09 20:31 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 20:31 UTC (permalink / raw)
To: Andreas Dilger
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas
On Fri, 9 Jun 2006, Andreas Dilger wrote:
>
> It's funny that everyone is arguing to fork ext3 into ext4, for a feature
> that will primarily allow it to work with large disks (that are already
> here, not some wacky pipe dream of featuritis as Jeff thinks). Yet the
> same people that are advocating code duplication on a massive scale in
> ext4 are against 5 lines of duplication between the VFS and a filesystem,
> or in a couple of drivers here and there.
You haven't actually listened to any of the arguments, have you?
Please remove me from the Cc on this thread. I'm not interested.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:16 ` Andreas Dilger
2006-06-09 20:31 ` Linus Torvalds
@ 2006-06-09 20:31 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:31 UTC (permalink / raw)
To: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
Andreas Dilger wrote:
> The other issue is that adding a new "ext4" filesystem type will cause
> userspace tools to break that assume they know something about the
> filesystem type. They will all detect the filesystem as "ext3" and try
> to mount it as such, when the required kernel filesystem is ext4. Or
> we will need to have "mkfs.ext4", "fsck.ext4", etc, for no particular
> reason.
Yes, you want those tools, and you want to call the filesystem ext4.
Otherwise you'll never break free of the existing metadata formats
(which are apparently changing over time _anyway_).
> Either a system upgrades totally to ext4 to avoid the duplication of code
> in memory (and breaks ALL backward compatibility, for no good reason), or
Correct. You must upgrade totally to ext4.
And this happens ANYWAY once extents/etc. are enabled. Its an upgrade.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:30 ` Alex Tomas
2006-06-09 18:38 ` Linus Torvalds
@ 2006-06-09 18:43 ` Jeff Garzik
2006-06-09 18:50 ` Diego Calleja
2 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:43 UTC (permalink / raw)
To: Alex Tomas
Cc: Linus Torvalds, Andreas Dilger, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel
Alex Tomas wrote:
>>>>>> Linus Torvalds (LT) writes:
> LT> My point is, maintaining two different pieces is SIMPLER.
>
> "different" is a key word here. why should we copy most of ext3 code
> into ext4?
>
> LT> It would be bigger, if you made ext3 do 48-bit block numbers.
>
> nope, we re-use existing i_data w/o any changes. yes, we've made
> inode a bit larger to cache last found extent. this improves
> performance in some workloads noticable though.
>
> LT> See? ext3 would become strictly _worse_ for the majority of users, who
> LT> wouldn't get any advantage. That's my point.
>
> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
No, that would be worse.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:30 ` Alex Tomas
2006-06-09 18:38 ` Linus Torvalds
2006-06-09 18:43 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 18:50 ` Diego Calleja
2006-06-09 19:08 ` Diego Calleja
2 siblings, 1 reply; 295+ messages in thread
From: Diego Calleja @ 2006-06-09 18:50 UTC (permalink / raw)
To: Alex Tomas
Cc: torvalds, adilger, alex, jeff, akpm, ext2-devel, linux-kernel,
cmm, linux-fsdevel
El Fri, 09 Jun 2006 22:30:20 +0400,
Alex Tomas <alex@clusterfs.com> escribió:
> LT> See? ext3 would become strictly _worse_ for the majority of users, who
> LT> wouldn't get any advantage. That's my point.
>
> would "#if CONFIG_EXT3_EXTENTS" be a good solution then?
Not at all, a config option may be disabled by lots of distros
and make backwards compatibility even more difficult than
is already going to be.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:50 ` Diego Calleja
@ 2006-06-09 19:08 ` Diego Calleja
0 siblings, 0 replies; 295+ messages in thread
From: Diego Calleja @ 2006-06-09 19:08 UTC (permalink / raw)
To: Diego Calleja
Cc: akpm, jeff, ext2-devel, linux-kernel, torvalds, cmm,
linux-fsdevel, alex, adilger
El Fri, 9 Jun 2006 20:50:00 +0200,
Diego Calleja <diegocg@gmail.com> escribió:
> Not at all, a config option may be disabled by lots of distros
> and make backwards compatibility even more difficult than
> is already going to be.
(I meant: Distros could switch it off, and in a two years
timeframe for some reason you could try to read data from
a disk created by a kernel with that feature and it wont
work, meanwhile with the current approach you'll be able
to use a mount flag.)
In my very humble user opinion, the big difference between ext2/3
and ext3/4 is that ext2/3 really was supposed to be on-disk
compatible, except for the journal. However ext4, AIUI, is
supposed to be _really_ different.
The kernel that includes the 48bit patches will already be a sort
of "ext4" filesystem, since 2.6.17 and previous kernels are not
going to be able to read it. Moving the source to ext4 or keeping
it in ext3/ is just about mainteinance, not about making a new
filesystem or not - that will happen as soon as the patches are
merged.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:10 ` [Ext2-devel] " Andreas Dilger
2006-06-09 18:22 ` Linus Torvalds
@ 2006-06-09 18:40 ` Jeff Garzik
2006-06-09 18:59 ` Andrew Morton
2006-06-09 18:41 ` Jeff Garzik
2 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:40 UTC (permalink / raw)
To: Andreas Dilger
Cc: Linus Torvalds, Alex Tomas, Andrew Morton, ext2-devel,
linux-kernel, cmm, linux-fsdevel
Andreas Dilger wrote:
> Having a single codebase for everyone means that it is continually maintained
> and users of ext3 aren't left out in the cold.
That implies continually upgrading ext3 for newer storage technologies,
which in turn implies adding all sorts of incompatible formats to
support better storage scaling, and new usage models.
This constant patching of ext3 is IMO one of the problems. Let it
stabilize with current storage technologies.
> On Jun 09, 2006 09:54 -0700, Linus Torvalds wrote:
>> Btw, I'm not kidding you on this one.
>>
>> THE NUMBER ONE MEMORY USAGE ON A LOT OF LOADS IS EXT3 INODES IN MEMORY!
>
> Do you think that would be any different with a new filesystem?
>
>> And you know what? 2TB files are totally uninteresting to 99.9999% of all
>> people. Most people find it _much_ more interesting to have hundreds of
>> thousands of _smaller_ files instead.
>>
>> So do this:
>>
>> cat /proc/slabinfo | grep ext3
>
> # head -2 /proc/slabinfo
> slabinfo - version: 2.1
> name <active_objs> <num_objs> <objsize> <objperslab>
>
> # grep ext2 /proc/slabinfo
> ext2_inode_cache 0 0 572 7
> ext2_xattr 0 0 48 81
>
> # grep ext3 /proc/slabinfo
>
> ext3_inode_cache 30207 41418 616 6
> ext3_xattr 0 0 48 81
>
> # grep xfs /proc/slabinfo
> xfs_ili 2558 2576 140 28
> xfs_inode 2558 2565 448 9
>
> # grep jfs /proc/slabinfo
> jfs_ip 0 0 1048 3
>
> So, the ext3 inode could grow another ~50 bytes without changing the
> slab allocation size ;-), and in fact other filesystem aren't noticably
> different.
>
>> and be absolutely disgusted and horrified by the size of those inodes
>> already, and ask yourself whether extending the block size to 48 bits will
>> help or further hurt one of the biggest problems of ext3 right now?
>
> This is then the biggest problem of all filesystems.
>
>> (And yes, I realize that block numbers are just a small part of it. The
>> "vfs_inode" is also a real problem - it's got _way_ too many large
>> list-heads that explode on a 64-bit kernel, for example. Oh, well.
>
> On a 32-bit system the vfs_inode is more than half of the size of the ext3
> inode, it is worse on 64-bit systems.
>
>> My point is that things like this can make a very real issue _worse_ for all
>> the people who don't care one whit about it)
>
> The current group of changes will be a no-op if CONFIG_LBD isn't enabled,
> and I think I argued fairly strongly to also have a CONFIG_ flag to allow
> larger than 2TB file support only for those users that want it.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
>
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:40 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 18:59 ` Andrew Morton
2006-06-09 19:16 ` Jeff Garzik
2006-06-09 20:44 ` Alan Cox
0 siblings, 2 replies; 295+ messages in thread
From: Andrew Morton @ 2006-06-09 18:59 UTC (permalink / raw)
To: Jeff Garzik
Cc: ext2-devel, linux-kernel, torvalds, cmm, linux-fsdevel, alex,
adilger
On Fri, 09 Jun 2006 14:40:56 -0400
Jeff Garzik <jeff@garzik.org> wrote:
> Andreas Dilger wrote:
> > Having a single codebase for everyone means that it is continually maintained
> > and users of ext3 aren't left out in the cold.
>
> That implies continually upgrading ext3 for newer storage technologies,
> which in turn implies adding all sorts of incompatible formats to
> support better storage scaling, and new usage models.
Look, I'm not certain either way on this - I really don't like the format
incompatibility and I'd like to see a breakdown of the performance benefits
of each of the proposed new features so perhaps we can cherrypick. And I'm
deferring judgement until I've looked at some patches.
But Jeff, please stop this wild exaggeration! "continually upgrading",
"all sorts of incompatible formats". It's not helping anything.
Today's ext3 is, afaik, 100% on-disk compatible with ext3 from five years
ago, and probably with RH's 2.2-based implementation. So we have not done
and will not do the things which you are FUDding us about.
This is (again, as far as I recall) the first on-disk-incompatible change
in ext3 which has ever been proposed. It's not a thing which is done
lightly and it's not a thing which is likely to happen again for a very long
time indeed.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:59 ` Andrew Morton
@ 2006-06-09 19:16 ` Jeff Garzik
2006-06-09 20:27 ` [Ext2-devel] " Chase Venters
2006-06-09 20:44 ` Alan Cox
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:16 UTC (permalink / raw)
To: Andrew Morton
Cc: ext2-devel, linux-kernel, torvalds, cmm, linux-fsdevel, alex,
adilger
Andrew Morton wrote:
> On Fri, 09 Jun 2006 14:40:56 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
>
>> Andreas Dilger wrote:
>>> Having a single codebase for everyone means that it is continually maintained
>>> and users of ext3 aren't left out in the cold.
>> That implies continually upgrading ext3 for newer storage technologies,
>> which in turn implies adding all sorts of incompatible formats to
>> support better storage scaling, and new usage models.
>
> Look, I'm not certain either way on this - I really don't like the format
> incompatibility and I'd like to see a breakdown of the performance benefits
> of each of the proposed new features so perhaps we can cherrypick. And I'm
> deferring judgement until I've looked at some patches.
>
> But Jeff, please stop this wild exaggeration! "continually upgrading",
> "all sorts of incompatible formats". It's not helping anything.
>
> Today's ext3 is, afaik, 100% on-disk compatible with ext3 from five years
> ago, and probably with RH's 2.2-based implementation. So we have not done
> and will not do the things which you are FUDding us about.
>
> This is (again, as far as I recall) the first on-disk-incompatible change
> in ext3 which has ever been proposed. It's not a thing which is done
> lightly and it's not a thing which is likely to happen again for a very long
> time indeed.
That's not really true, I include in the list EXT3_FEATURE_RO_COMPAT_*,
EXT3_FEATURE_INCOMPAT_*, 32-bit uid/gid, ISTR some ACL-related mess, and
the online resizing stuff that produces a filesystem slightly different
than what mke2fs would produce for the same [larger] sized block device.
Red Hat has had at least one problem in the past where users were
annoyed at format changes (htree?).
I certainly grant that extents and 48bit are format changes on a -much-
larger scale than in the past. Absolutely.
That's why I feel that this is a good point to calm down ext3
development, and start putting stuff like extents into ext4. If we are
starting to make major changes to the format, that should be a signal
that we are starting to work on a new filesystem, rather than patching
an old one.
I disagree with the "years to stabilize ext4" argument, because we are
starting from a known good point. I think ext4 will be easier to
maintain and tune for modern storage systems, if we don't have to worry
as much about that stuff for ext3.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 19:16 ` Jeff Garzik
@ 2006-06-09 20:27 ` Chase Venters
0 siblings, 0 replies; 295+ messages in thread
From: Chase Venters @ 2006-06-09 20:27 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, adilger, torvalds, alex, ext2-devel, linux-kernel,
cmm, linux-fsdevel
On Fri, 9 Jun 2006, Jeff Garzik wrote:
> I disagree with the "years to stabilize ext4" argument, because we are
> starting from a known good point. I think ext4 will be easier to maintain
> and tune for modern storage systems, if we don't have to worry as much about
> that stuff for ext3.
Let's say we
# cp ext3 ext4
# cat extents 48bit | patch
and then roll it out in 2.6.18. That in and of itself is probably fine and
stable (though it's no different than ext3 except for the name and the two
new additions).
But are you going to do this again for ext5 when more features come along?
Or are you going to warn ext4 users that the FS is not expected to be stable?
If you do the latter, be prepared for people to be wary of using it for a
long while. The difference is between actual and perceived stability.
To put a finer point on it - I've got a system that's been running
flawlessly for years on 2.5.3. It's actually been stable - never had any
sort of crashing problem at all. But I'm essentially crazy for running
that kernel. At the time I installed it, it certainly wasn't perceived as
stable. If the computer in question were any more than a file server /
iptables box for my home, I'd have said "well, hell, I think I'm going to
have to do without 2.5 so that I can have something trustworthy."
(Amusingly enough, I started assembling a replacement for it recently,
if only to have something newer and more capable. Having gone from
Slackware to Gentoo I decided to give the April stable
Debian release a whirl. Imagine my shock and awe when I watched Debian
boot into a 2.4 kernel :P)
> Jeff
>
Cheers,
Chase
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:59 ` Andrew Morton
2006-06-09 19:16 ` Jeff Garzik
@ 2006-06-09 20:44 ` Alan Cox
2006-06-11 15:52 ` [Ext2-devel] " Arjan van de Ven
1 sibling, 1 reply; 295+ messages in thread
From: Alan Cox @ 2006-06-09 20:44 UTC (permalink / raw)
To: Andrew Morton
Cc: Jeff Garzik, ext2-devel, linux-kernel, torvalds, cmm,
linux-fsdevel, alex, adilger
Ar Gwe, 2006-06-09 am 11:59 -0700, ysgrifennodd Andrew Morton:
> Today's ext3 is, afaik, 100% on-disk compatible with ext3 from five years
> ago, and probably with RH's 2.2-based implementation.
If your files are under 2GB long, you've not used any attributes,
SELinux labels or various other things maybe. In the practical real
world case it isn't. I doubt many Fedora/Red Hat users have a single FS
from RHEL4/FC1 onwards that is readable by 2.2 ext3 (or most 2.4 ext3)
OTOH the number of complaints about this is minimal, people want to go
forwards in a controlled manner not backwards.
Alan
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 20:44 ` Alan Cox
@ 2006-06-11 15:52 ` Arjan van de Ven
0 siblings, 0 replies; 295+ messages in thread
From: Arjan van de Ven @ 2006-06-11 15:52 UTC (permalink / raw)
To: Alan Cox
Cc: Andrew Morton, Jeff Garzik, adilger, torvalds, alex, ext2-devel,
linux-kernel, cmm, linux-fsdevel
On Fri, 2006-06-09 at 21:44 +0100, Alan Cox wrote:
> OTOH the number of complaints about this is minimal, people want to go
> forwards in a controlled manner not backwards.
well... they want to be able to go "a little bit" backwards; say one
version of an OS (6 months). Eg the scenario that ought to work is "go
to newer version, hate it, go back". But yes that's a limited time to go
back, not the "go back to 2.2" kind of "go back".
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:10 ` [Ext2-devel] " Andreas Dilger
2006-06-09 18:22 ` Linus Torvalds
2006-06-09 18:40 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 18:41 ` Jeff Garzik
2 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:41 UTC (permalink / raw)
To: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel
Andreas Dilger wrote:
> The current group of changes will be a no-op if CONFIG_LBD isn't enabled,
> and I think I argued fairly strongly to also have a CONFIG_ flag to allow
> larger than 2TB file support only for those users that want it.
Please be realistic.
Distros will all want to turn this on, from now until eternity.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:25 ` Linus Torvalds
2006-06-09 16:48 ` Alex Tomas
2006-06-09 16:54 ` [Ext2-devel] " Linus Torvalds
@ 2006-06-09 17:12 ` Jeff Anderson-Lee
2006-06-09 18:02 ` Andrew Morton
3 siblings, 0 replies; 295+ messages in thread
From: Jeff Anderson-Lee @ 2006-06-09 17:12 UTC (permalink / raw)
To: linux-kernel; +Cc: 'ext2-devel', linux-fsdevel
Linus Torvalds wrote:
> On Fri, 9 Jun 2006, Alex Tomas wrote:
>
> > I believe it's as stable as before until you mount with extents
> > mount option.
>
> In contrast, the last time two different filesystems introduced bugs in
> each other was approximately "never". They simply don't modify each others
> code, they don't look at each others data structures, and they don't jump
> into each others routines.
As an interested bystander (and large filesystem user), I'd say I tend to
agree with Linus and Jeff on this one.
* ext3 is arguably the main Linux filesystem: too important to keep
"experimenting" with.
* I'd encourage a >2TB version, but call it ext4. It makes it clear
that you are entering new territory.
* Take advantage of the switch to remove some of the backward compatibility
cruft from the ext4 version -- make it a clean, explicit break.
* [Possibly even inoculate ext3 against creeping featuris and work on
cleanup and optimization instead.]
This is not intended to slight the work/position of the ext3 developers,
merely to inform them of an end-user's perspective.
----
Jeff Anderson-Lee
Petabyte Storage Infrastructure Project
University of California Berkeley
"Simplify, simplify, simplify." -- Henry David Thoreau
"I think one 'simplify' would have sufficed." -- Ralph Waldo Emerson
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:25 ` Linus Torvalds
` (2 preceding siblings ...)
2006-06-09 17:12 ` Jeff Anderson-Lee
@ 2006-06-09 18:02 ` Andrew Morton
3 siblings, 0 replies; 295+ messages in thread
From: Andrew Morton @ 2006-06-09 18:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: alex, jeff, ext2-devel, linux-kernel, cmm, linux-fsdevel, adilger
On Fri, 9 Jun 2006 09:25:57 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:
> (buffer heads! In 2006!)
We should be able to make the vast majority of those go away, btw.
We already have `-o data=writeback,nobh'. That gives us writeback-mode
with no buffer_heads on the pagecache.
On top of that we can implement nobh ordered-mode by adding an inode walk
which calls do_sync_file_range() into the appropriate place in commit.
The tricky part is the inode walk - at present super_block.s_list is a
list_head and it's not trivial to walk that without missing some inodes.
Probably it could be done via a new fs-private dirty-inode list which we
hande carefully, or via a walk of an i_ino-ordered radix-tree, which
doesn't miss things.
I floated this a year or so ago, but no little fishies bit.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:25 ` Jeff Garzik
2006-06-09 15:40 ` Linus Torvalds
@ 2006-06-10 19:10 ` Kyle Moffett
2006-06-10 19:27 ` Linus Torvalds
1 sibling, 1 reply; 295+ messages in thread
From: Kyle Moffett @ 2006-06-10 19:10 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Andreas Dilger
On Jun 9, 2006, at 11:25:31, Jeff Garzik wrote:
> Overall, I'm surprised that ext3 developers don't see any of the
> problems related to progressive, stealth filesystem upgrades.
>
> Users are never given a clear indication of when their metadata is
> being upgraded, there is no clear "line of demarcation" they cross,
> when they start using extents.
>
> Since there is no user-visible fs upgrade event, users do not have
> a clear picture of what features are being used -- which means they
> are kept in the dark about which kernels are OK to use on their data.
>
> Do you guys honestly expect users to keep track of which kernels
> added specific ext3 features?
>
> This is why other enterprise filesystems have clear "fs version 1",
> "fs version 2" points across which a user migrates. ext3's feature-
> flags approach just means that there are a million combinations of
> potential old-and-new features, in-tree and third party, all of
> which must be supported.
One possible solution to the version-confusion that would avoid
duplicating features would be to merge the fs/ext{2,3} to fs/ext,
then make fs/ext register itself as a filesystem under "ext2",
"ext3", and "ext4". Then have each name imply a specific set of
features and compatibility. That would allow the same performance
optimizations to affect all 3 even as you make metadata changes in
the latest version. I've heard quite some griping about the amount
of duplicated code between ext2 and ext3; why cause those problems
again with an "ext4"? There would probably be some fs/ext/ext{2,3,4}
_foo.c files that could be compiled in or out depending on configured
FS support, but I would guess that would make it easier on users and
developers alike.
Cheers,
Kyle Moffett
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 19:10 ` Kyle Moffett
@ 2006-06-10 19:27 ` Linus Torvalds
0 siblings, 0 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-10 19:27 UTC (permalink / raw)
To: Kyle Moffett
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Andreas Dilger
On Sat, 10 Jun 2006, Kyle Moffett wrote:
>
> One possible solution to the version-confusion that would avoid duplicating
> features would be to merge the fs/ext{2,3} to fs/ext, then make fs/ext
> register itself as a filesystem under "ext2", "ext3", and "ext4".
But the thing is, technical people don't actually care about the version
confusion.
The real issue is that ext3 is a stable filesystem, and the ext4 stuff
buys fundamentally and absolutely _nothing_ for the vast majority of uses.
Except pain.
So the real reason for the split would be the _user_ split. There are
people who want big filesystems, and there are people who don't care.
It's that simple.
> I've heard quite some griping about the amount of duplicated code
> between ext2 and ext3;
That's a total piece of bullshit. Nobody seriously gripes about the
duplication, and the ones that do have absolutely no idea what that split
bought us. Ignore them.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 15:08 ` Jeff Garzik
2006-06-09 15:25 ` Jeff Garzik
@ 2006-06-09 15:28 ` Alex Tomas
2006-06-09 15:31 ` Matthew Wilcox
` (2 more replies)
2006-06-09 20:32 ` Stephen C. Tweedie
2 siblings, 3 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 15:28 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andreas Dilger, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel
JG> "ext3" will become more and more meaningless. It could mean _any_ of
JG> several filesystem metadata variants, and the admin will have no clue
JG> which variant they are talking to until they try to mount the blkdev
JG> (and possibly fail the mount).
debugfs <dev> -R stats | grep features ?
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:28 ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 15:31 ` Matthew Wilcox
2006-06-10 3:26 ` Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3) Valerie Henson
2006-06-09 15:44 ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
2006-06-09 15:53 ` [Ext2-devel] " Gerrit Huizenga
2 siblings, 1 reply; 295+ messages in thread
From: Matthew Wilcox @ 2006-06-09 15:31 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 07:28:22PM +0400, Alex Tomas wrote:
> JG> "ext3" will become more and more meaningless. It could mean _any_ of
> JG> several filesystem metadata variants, and the admin will have no clue
> JG> which variant they are talking to until they try to mount the blkdev
> JG> (and possibly fail the mount).
>
> debugfs <dev> -R stats | grep features ?
... a simple and intuitive command which just trips off the tongue.
I want extents, but I'm still unconvinced that ext3 needs to grow beyond
32-bit blocks. The scheme posted by Val and Arjan (with the
continuation inodes) seems much neater.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
2006-06-09 15:31 ` Matthew Wilcox
@ 2006-06-10 3:26 ` Valerie Henson
2006-06-10 5:25 ` Andreas Dilger
2006-06-10 14:22 ` Jeff Garzik
0 siblings, 2 replies; 295+ messages in thread
From: Valerie Henson @ 2006-06-10 3:26 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Jeff Garzik, Arjan van de Ven, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
On Fri, Jun 09, 2006 at 09:31:16AM -0600, Matthew Wilcox wrote:
>
> I want extents, but I'm still unconvinced that ext3 needs to grow beyond
> 32-bit blocks. The scheme posted by Val and Arjan (with the
> continuation inodes) seems much neater.
Well, thanks! Arjan and I like our idea too, but at this point it's
just an idea. We'll be hashing it out some more at the file system
workshop next week.
To be honest, continuation inodes and these ext3 patches are
addressing different problems. ext3 48-bit extents are an advanced
solution to a complex problem - growing ext3 beyond 8TB while keeping
as much of the existing on-disk format and associated stable code as
possible. It's hard work and the ext3 developers came up with some
good ideas. Continuation inodes are an idea about how to limit error
propagation in large file systems - an idea which happens to allow
file systems larger than 8 TB with 32-bit block pointers.
So what the heck are continuation inodes? Actually, we named this
"chunkfs" - not particularly descriptive, maybe continuation inodes is
a better term.
Continuation inodes/chunkfs are an idea Arjan and I came up with,
inspired loosely by the ext2 dirty bit code. The problem we were
trying to solve is how to isolate the effects of file system
corruption (from crash, bug, or I/O error) so that we didn't have to
run fsck over the entire file system in order to repair it. This is
important because disk bandwidth is not growing as fast as disk
capacity, so the absolute time to read the entire disk is growing.
The basic idea is to create a bunch of small file systems - chunks -
which look like one big file system to the administrator. Major
problems to solve:
1. Files which span more than one chunk (file system).
2. Hard links from a directory in chunk A to a file in chunk B.
The solution we came up with is to create a "continuation inode" in
every file system chunk which contains data for a particular file or
directory. For example, if file "foo" has its inode in chunk A, and
some file data in chunk B, we would create a continuation inode in
chunk B. The continuation inode has a back pointer to the parent
inode. Now imagine there is some kind of corruption in chunk B and we
need to check the file system. We can determine the free or allocated
state of every block in chunk B without reading any metadata outside
of chunk B.
Similarly, if we create a hard link to file "foo" in chunk A from
directory "bar" in chunk B, we will allocate a continuation inode for
directory "bar" in chunk B, and then allocate a block to contain the
link to "foo" in chunk B. Once again, to find the link count of every
inode in chunk B, we only have to look at directories inside of chunk
B. There are still problems that require checking across chunks, but
we only need to read inodes and directory entries in those cases and
the checks are much simpler than in existing fsck.
One interesting possibility would be to combine this with the ext2
dirty bit patches. They create a clean/dirty bit for an ext2 file
system. If the system crashes while the file system is being written
to, the bit is set to dirty and we do a full fsck. If the system
crashes while it's inactive, the bit is clean, and all we have to do
is a little bit of orphan inode cleanup before mounting. If we
implement chunkfs on top of this, we could get away with fsck'ing only
a few of the file systems each time, getting ext2-style performance
with ext3-style fast recovery.
I measured the number of different block groups that were
simultaneously dirty on my laptop's file system as a proxy for how
many chunks would be dirty; it turns out that on average most block
groups were clean 98% of the time, and when I really pushed my
(admittedly dinky) disk I/O system with an artificial load, only a
maximum of 25% of the block groups were dirty during any one second
period. So it's tempting... We'll talk about it more next week, I
hope.
-VAL
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
2006-06-10 3:26 ` Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3) Valerie Henson
@ 2006-06-10 5:25 ` Andreas Dilger
2006-06-10 5:41 ` Valerie Henson
2006-06-10 14:22 ` Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 5:25 UTC (permalink / raw)
To: Valerie Henson
Cc: Andrew Morton, Jeff Garzik, Matthew Wilcox, Arjan van de Ven,
ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
Alex Tomas
On Jun 09, 2006 20:26 -0700, Valerie Henson wrote:
> To be honest, continuation inodes and these ext3 patches are
> addressing different problems. ext3 48-bit extents are an advanced
> solution to a complex problem - growing ext3 beyond 8TB while keeping
> as much of the existing on-disk format and associated stable code as
> possible.
The 48-bit support was acutally only a small of the originalreason for
extents, while it seems to be the most popular right now. The other
issues that are being addressed are:
- performance issues like avoiding 0.1%+ indirect block metadata overhead
for each file which is bad for the cache, and also hurts unlinks)
- the extent index blocks are also more robust than indirect blocks (they
have a magic and internally verifiable structure, and the possibility
to easily add metadata checksums and extent->inode backpointers to
allow improved filesystem checking). With large ext3 filesystems the
{d,t,}indirect blocks can have random garbage in them and there is no
way for the kernel to know unless it overlaps with other fixed metadata
- the ability to do things like preallocation of files efficiently (via
uninitialized extents), instead of zero-filling the whole file.
> Continuation inodes/chunkfs are an idea Arjan and I came up with,
> inspired loosely by the ext2 dirty bit code. The problem we were
> trying to solve is how to isolate the effects of file system
> corruption (from crash, bug, or I/O error) so that we didn't have to
> run fsck over the entire file system in order to repair it.
I think this is a great idea, and one that is very similar to what
we are doing with ext3 filesystems in Lustre. There is definitely
a desire to harden the ext3 code in many ways against such failures,
and being able to check independent parts of the filesystem is a
very desirable part of this.
> The solution we came up with is to create a "continuation inode" in
> every file system chunk which contains data for a particular file or
> directory. For example, if file "foo" has its inode in chunk A, and
> some file data in chunk B, we would create a continuation inode in
> chunk B. The continuation inode has a back pointer to the parent
> inode.
This needs some extra data in the directory entry, which I've already
been thinking about for ext3, so if you are looking at implementing
this for ext3 I'd be happy to share some ideas.
> One interesting possibility would be to combine this with the ext2
> dirty bit patches.
Put on your asbestos vest before suggesting any changes to ext2 :-).
> If we implement chunkfs on top of this, we could get away with fsck'ing
> only a few of the file systems each time, getting ext2-style performance
> with ext3-style fast recovery.
While fast recovery is one aspect of ext3 journaling, the other one
is that this allows multiple filesystem changes to be made atomically
and they are rolled back as a set if the system crashes in the middle.
> We'll talk about it more next week, I hope.
I look forward to it.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
2006-06-10 5:25 ` Andreas Dilger
@ 2006-06-10 5:41 ` Valerie Henson
2006-06-10 6:22 ` Andreas Dilger
0 siblings, 1 reply; 295+ messages in thread
From: Valerie Henson @ 2006-06-10 5:41 UTC (permalink / raw)
To: Matthew Wilcox, Alex Tomas, Andrew Morton, Jeff Garzik,
ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
Arjan van de Ven
On Fri, Jun 09, 2006 at 11:25:02PM -0600, Andreas Dilger wrote:
>
> The 48-bit support was acutally only a small of the originalreason for
> extents, while it seems to be the most popular right now.
*nod*
> This needs some extra data in the directory entry, which I've already
> been thinking about for ext3, so if you are looking at implementing
> this for ext3 I'd be happy to share some ideas.
Actually, it seems vaguely possible this could be implemented as a
layer on top of any normal file system - just use files to store
continuation inodes and the like. Then you could use the file system
that best suits your workload underneath. (Suparna has a paper in the
next OLS talking about something related but not identical, check it
out.) Most likely it would be criminally wasteful of space and really
slow, but it's something to think about.
> > One interesting possibility would be to combine this with the ext2
> > dirty bit patches.
>
> Put on your asbestos vest before suggesting any changes to ext2 :-).
*laugh* What about ext2.5? :) Seriously, ext2 needs to be left alone,
but I'm open to the possibility that any of the existing file system
code bases could be forked off into a development file system. Some
ideas would be more compatible with some code bases than others, and
forking might get rid of some constraints - e.g., an XFS fork could
get rid of a lot of crufty compat code.
-VAL
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
2006-06-10 5:41 ` Valerie Henson
@ 2006-06-10 6:22 ` Andreas Dilger
0 siblings, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 6:22 UTC (permalink / raw)
To: Valerie Henson
Cc: Andrew Morton, Jeff Garzik, Matthew Wilcox, Arjan van de Ven,
ext2-devel, linux-kernel, Linus Torvalds, cmm, linux-fsdevel,
Alex Tomas
On Jun 09, 2006 22:41 -0700, Valerie Henson wrote:
> On Fri, Jun 09, 2006 at 11:25:02PM -0600, Andreas Dilger wrote:
> > This needs some extra data in the directory entry, which I've already
> > been thinking about for ext3, so if you are looking at implementing
> > this for ext3 I'd be happy to share some ideas.
>
> Actually, it seems vaguely possible this could be implemented as a
> layer on top of any normal file system - just use files to store
> continuation inodes and the like. Then you could use the file system
> that best suits your workload underneath.
That is basically Lustre. One filesystem (the metadata filesystem, MDS)
holds just the pathnames and some EA data that points to other files
(these are essentially "file continuation inodes"). The data filesystems
(object storage filesystems, OST) have the file data RAID0 striped over
multipe OST "objects". The objects are just regular files stored in
ext3 filesystems.
In clustered metadata Lustre (CMD) there are also continuation inodes for
files in a single directory, but currently a 2TB MDS filesystem is plenty
big for holding just filenames and inodes.
The same problems exist with Lustre that you have to face with the
continuation inode scheme - files that grow too large for a single
chunk, cross-chunk namespace links, etc.
Of course we'd be thrilled if there was a desire to implement Lustre
at a completely local-filesystem level (removing a lot of the networking
and required recovery mechanism), though it would also be desirable to
have the ability to move a filesystem from a local box to a distributed
filesystem (ala X11) without any changes.
> (Suparna has a paper in the next OLS talking about something related
> but not identical, check it out.)
Interesting, I'll have to take a look.
> forking might get rid of some constraints - e.g., an XFS fork could
> get rid of a lot of crufty compat code.
It continually amazes me that XFS even made it into the kernel as it
currently stands, because of the normally vehement objections to any
kind of abstraction of code.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3)
2006-06-10 3:26 ` Continuation Inodes Explained! (was Re: [RFC 0/13] extents and 48bit ext3) Valerie Henson
2006-06-10 5:25 ` Andreas Dilger
@ 2006-06-10 14:22 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 14:22 UTC (permalink / raw)
To: Valerie Henson
Cc: Andrew Morton, Matthew Wilcox, Arjan van de Ven, ext2-devel,
linux-kernel, Linus Torvalds, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
Valerie Henson wrote:
> So what the heck are continuation inodes? Actually, we named this
> "chunkfs" - not particularly descriptive, maybe continuation inodes is
> a better term.
[...]
> The basic idea is to create a bunch of small file systems - chunks -
> which look like one big file system to the administrator. Major
Back when I was still playing with my experimental filesystem, one of
the short-list features I was planning on implementing was the
allocation of both metadata and data from the same underlying data
store, essentially collections of "buckets" for data.
The data store would be a succession of progressively-smaller buckets.
Typical bucket sizes (chosen by admin) on a single filesystem might be:
1G, 128M, 4M, 1M, 64k, 4k. The largest (top-most) bucket is the
fundamental unit of allocation for the filesystem, from which all other
metadata and data is read/allocated.
So in my example above, the 1G bucket is analagous to a single chunk in
chunkfs, and any number of 1G buckets -- from any number of block
devices -- may comprise a single filesystem.
New inode tables, bitmap chunks, directories, large files, etc. are all
allocated from an "appropriate" bucket. IMO this type of solution
provides fsck-friendly isolation, and adds sufficient flexibility for
doing things like delayed alloc, metadata-is-a-file, etc.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 15:28 ` [Ext2-devel] " Alex Tomas
2006-06-09 15:31 ` Matthew Wilcox
@ 2006-06-09 15:44 ` Jeff Garzik
2006-06-09 15:53 ` Alex Tomas
2006-06-09 18:29 ` Andreas Dilger
2006-06-09 15:53 ` [Ext2-devel] " Gerrit Huizenga
2 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:44 UTC (permalink / raw)
To: Alex Tomas
Cc: Andreas Dilger, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel
Alex Tomas wrote:
> JG> "ext3" will become more and more meaningless. It could mean _any_ of
> JG> several filesystem metadata variants, and the admin will have no clue
> JG> which variant they are talking to until they try to mount the blkdev
> JG> (and possibly fail the mount).
>
> debugfs <dev> -R stats | grep features ?
The question is, do you
a) expect users to run this magic command, and DTRT or
b) watch users boot w/ extents, accidentally do something silly like
writing data to a file, and become locked into a new subset of kernels?
The simple act of writing data to a file has become an _irrevocable
filesystem upgrade event_.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:44 ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
@ 2006-06-09 15:53 ` Alex Tomas
2006-06-09 15:52 ` Jeff Garzik
2006-06-09 18:29 ` Andreas Dilger
1 sibling, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 15:53 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
>>>>> Jeff Garzik (JG) writes:
JG> Alex Tomas wrote:
JG> "ext3" will become more and more meaningless. It could mean
>> _any_ of JG> several filesystem metadata variants, and the admin
>> will have no clue JG> which variant they are talking to until they
>> try to mount the blkdev JG> (and possibly fail the mount).
>> debugfs <dev> -R stats | grep features ?
JG> The question is, do you
JG> a) expect users to run this magic command, and DTRT or
JG> b) watch users boot w/ extents, accidentally do something silly like
JG> writing data to a file, and become locked into a new subset of kernels?
at the moment there is no way to "boot w/ extents". you must enable
them by mount option.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:53 ` Alex Tomas
@ 2006-06-09 15:52 ` Jeff Garzik
2006-06-09 16:02 ` Alex Tomas
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:52 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Andreas Dilger
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> Alex Tomas wrote:
> JG> "ext3" will become more and more meaningless. It could mean
> >> _any_ of JG> several filesystem metadata variants, and the admin
> >> will have no clue JG> which variant they are talking to until they
> >> try to mount the blkdev JG> (and possibly fail the mount).
> >> debugfs <dev> -R stats | grep features ?
>
> JG> The question is, do you
>
> JG> a) expect users to run this magic command, and DTRT or
>
> JG> b) watch users boot w/ extents, accidentally do something silly like
> JG> writing data to a file, and become locked into a new subset of kernels?
>
> at the moment there is no way to "boot w/ extents". you must enable
> them by mount option.
Think about how distros will deploy this feature. Also, think about how
scalable that line of thinking is...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:52 ` Jeff Garzik
@ 2006-06-09 16:02 ` Alex Tomas
2006-06-09 16:04 ` [Ext2-devel] " Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 16:02 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
>>>>> Jeff Garzik (JG) writes:
JG> Alex Tomas wrote:
>> at the moment there is no way to "boot w/ extents". you must enable
>> them by mount option.
JG> Think about how distros will deploy this feature. Also, think about
JG> how scalable that line of thinking is...
I may be wrong, but I tend to think if they're stupid enough to enable
experimental mount option by default, they can do s/ext3/ext4 as well.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:02 ` Alex Tomas
@ 2006-06-09 16:04 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:04 UTC (permalink / raw)
To: Alex Tomas
Cc: Andreas Dilger, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> Alex Tomas wrote:
> >> at the moment there is no way to "boot w/ extents". you must enable
> >> them by mount option.
>
> JG> Think about how distros will deploy this feature. Also, think about
> JG> how scalable that line of thinking is...
>
> I may be wrong, but I tend to think if they're stupid enough to enable
> experimental mount option by default, they can do s/ext3/ext4 as well.
<sigh> At some point in the future, it will not be experimental.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:44 ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
2006-06-09 15:53 ` Alex Tomas
@ 2006-06-09 18:29 ` Andreas Dilger
1 sibling, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 18:29 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Linus Torvalds, cmm,
linux-fsdevel, Alex Tomas
On Jun 09, 2006 11:44 -0400, Jeff Garzik wrote:
> b) watch users boot w/ extents, accidentally do something silly like
> writing data to a file, and become locked into a new subset of kernels?
>
> The simple act of writing data to a file has become an _irrevocable
> filesystem upgrade event_.
You keep on saying this, but you know it won't happen TODAY. On the contrary,
if extents are merged today, I don't see distros making it a default mount
option for YEARS (it won't be the default for RHEL5, which is the only distro
that has participation on the ext3 developers, I can't comment for others).
WHEN extents become the default (which I hope they will at some point, like
dir_index and large inodes, that have been around for years already too)
then it will be mostly a non-issue (how many times do you boot into 2.2?).
The only exception is if you have a filesystem larger than 16TB you have
to use extents, which isn't an issue either way. I don't think they will
ever become the default for e.g. root or boot filesystems, just for
compatibility reasons, but are highly desirable for e.g. mythtv or other
"large file" using filesystems.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 15:28 ` [Ext2-devel] " Alex Tomas
2006-06-09 15:31 ` Matthew Wilcox
2006-06-09 15:44 ` [Ext2-devel] [RFC 0/13] extents and 48bit ext3 Jeff Garzik
@ 2006-06-09 15:53 ` Gerrit Huizenga
2006-06-09 16:03 ` Jeff Garzik
2006-06-09 16:09 ` Linus Torvalds
2 siblings, 2 replies; 295+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 15:53 UTC (permalink / raw)
To: Alex Tomas
Cc: Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
On Fri, 09 Jun 2006 19:28:22 +0400, Alex Tomas wrote:
> JG> "ext3" will become more and more meaningless. It could mean _any_ of
> JG> several filesystem metadata variants, and the admin will have no clue
> JG> which variant they are talking to until they try to mount the blkdev
> JG> (and possibly fail the mount).
>
> debugfs <dev> -R stats | grep features ?
Sounds similar to cat /proc/cpuinfo. How *do* we deal with processors
which have all these many different features? Probably better than we
would if each variant were viewed as a different architecture.
Jeff's approach taken to the rediculous would mean that we'd have
ext versions 1-40 by now at least. I don't think that helps much,
either.
I think the ext2/3 team has done a great job of providing compatibility.
It isn't perfect compatibility forwards *and* backwards, but moving
forwards always seems to be pretty reasonable.
gerrit
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 15:53 ` [Ext2-devel] " Gerrit Huizenga
@ 2006-06-09 16:03 ` Jeff Garzik
2006-06-09 16:09 ` Linus Torvalds
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:03 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Alex Tomas, Andrew Morton, ext2-devel, linux-kernel,
Linus Torvalds, cmm, linux-fsdevel, Andreas Dilger
Gerrit Huizenga wrote:
> Jeff's approach taken to the rediculous would mean that we'd have
> ext versions 1-40 by now at least. I don't think that helps much,
> either.
That's plainly silly. Like everything else in life, it is a balance of
costs.
At some point, ext3's fs-feature-flag approach increases the
combinations of metadata variants you must support exponentially.
Moving to extents and 48bit (which I want) is a big enough step that,
IMO, some of the support costs become far more obvious.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 15:53 ` [Ext2-devel] " Gerrit Huizenga
2006-06-09 16:03 ` Jeff Garzik
@ 2006-06-09 16:09 ` Linus Torvalds
2006-06-09 17:58 ` Gerrit Huizenga
1 sibling, 1 reply; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 16:09 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Alex Tomas, Jeff Garzik, Andrew Morton, ext2-devel, linux-kernel,
cmm, linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
>
> Jeff's approach taken to the rediculous would mean that we'd have
> ext versions 1-40 by now at least. I don't think that helps much,
> either.
On the other hand, I _guarantee_ you that it helps that we have ext2-3,
and not just ext2 (nobody even tried to keep ext1 compatible, thank the
Gods).
If for no other reason, than the fact that the ext3 development could be
much more aggressive early on. Exactly because it did NOT screw up the old
filesystem that everybody else depended on.
So we have empirical evidence that splitting filesystem work up does
actually help.
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:09 ` Linus Torvalds
@ 2006-06-09 17:58 ` Gerrit Huizenga
2006-06-09 18:25 ` [Ext2-devel] " Chase Venters
` (2 more replies)
0 siblings, 3 replies; 295+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 17:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, cmm,
linux-fsdevel, Alex Tomas, Andreas Dilger
On Fri, 09 Jun 2006 09:09:01 PDT, Linus Torvalds wrote:
> On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
> >
> > Jeff's approach taken to the rediculous would mean that we'd have
> > ext versions 1-40 by now at least. I don't think that helps much,
> > either.
>
> On the other hand, I _guarantee_ you that it helps that we have ext2-3,
> and not just ext2 (nobody even tried to keep ext1 compatible, thank the
> Gods).
I had originally argued for ext4 as well based on the fact that it would
allow lots of potential cleanups & simplifications and at the same time
would allow a break in the on disk filesystems layout.
These changes don't yet change the actual on-disk layout and that might
be something that would be done if ext4 were a real, new filesystem.
But then how long until ext4 is used enough to be put into production?
How much testing will it *really* get in any form? How long before
the people that are using 100 TB+ disk farms today (some of which are
chopping filesystems into 2-8 GB chunks, others with 2 TB filesystems
today) actually trust this new filesystem (most vendors don't support
JFS today, XFS support isn't much better).
We are seeing storage needs increasing at a frightening rate. Health
Care folks want to store your MRI's, x-ray's, ultraounds, etc. in high
res digital format across your entire life in near-line format. Terabytes
over time per person. Europe is already doing this pretty extensively,
the US is following suit. Digital media creation has huge storage needs.
Most everything is moving to podcasts, webcasts, streaming audio & video.
Storage is huge, and ext3 is at the current breaking point.
I'd argue that whatever we call it, we need a standard, stable, supported
solution *soon* for large files, large filesystems, large storage systems
in Linux.
I'd think the quickest path is to relieve the pressure now in ext3.
We still haven't solved the filesystem check time problem, which is the
next big bugaboo. But getting large fileysstems to real customers soon,
e.g. in mainline, well tested, ready for distro support is my real goal.
> If for no other reason, than the fact that the ext3 development could be
> much more aggressive early on. Exactly because it did NOT screw up the old
> filesystem that everybody else depended on.
Yes, but we want agressive with robustness for real users soon. Lots
of crazy ext4 development could become technical wanking in no time, with
no point of stability, and no general usefulness in the short term.
> So we have empirical evidence that splitting filesystem work up does
> actually help.
Agreed. But... Maybe that should be the set of changes *following*
extents. Then the file format can change, several of the pending ideas
can be worked in, and some of the backwards compatibility can be cleaned
out if it is in the way. Then the extents work can get us something
usable in all the interim distro releases for the real users who are
screaming now about the filesystem size limits.
gerrit
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 17:58 ` Gerrit Huizenga
@ 2006-06-09 18:25 ` Chase Venters
2006-06-10 13:46 ` Adrian Bunk
2006-06-13 13:34 ` [Ext2-devel] " Helge Hafting
2 siblings, 0 replies; 295+ messages in thread
From: Chase Venters @ 2006-06-09 18:25 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
> We are seeing storage needs increasing at a frightening rate. Health
> Care folks want to store your MRI's, x-ray's, ultraounds, etc. in high
> res digital format across your entire life in near-line format. Terabytes
> over time per person. Europe is already doing this pretty extensively,
> the US is following suit. Digital media creation has huge storage needs.
> Most everything is moving to podcasts, webcasts, streaming audio & video.
> Storage is huge, and ext3 is at the current breaking point.
>
> I'd argue that whatever we call it, we need a standard, stable, supported
> solution *soon* for large files, large filesystems, large storage systems
> in Linux.
>
> I'd think the quickest path is to relieve the pressure now in ext3.
Makes sense...
>> So we have empirical evidence that splitting filesystem work up does
>> actually help.
>
> Agreed. But... Maybe that should be the set of changes *following*
> extents. Then the file format can change, several of the pending ideas
> can be worked in, and some of the backwards compatibility can be cleaned
> out if it is in the way. Then the extents work can get us something
> usable in all the interim distro releases for the real users who are
> screaming now about the filesystem size limits.
Let's call ext3 "Linux 2.4" for a second and ext(x) w/extents and 48-bit
"Linux 2.5". We can now do all the crazy, wild work we want on 2.5, but
people need it tomorrow. And they can have it, but we're stamping
"Dangerous! Dangerous! Unstable! API changes every 5 minutes, your data
will be obsoleted each release!" all over it. This goes on for years until
we finally reach a point where we can roll out "Linux 2.6".
The trouble is that "Linux 2.6" is something many of us are going to be
wanting _now_.
Now, taking the quotes back off "Linux 2.6" and speaking about the kernel
as a whole again - isn't lots of incremental stable releases with new
functionality something that cutting off the development arm made
possible?
I acknowledge the concerns about filesystem stability and Linus's points
about improperly shared code. From a practical standpoint, I see the need
of bigger filesystems coming.
And the biggest practical problem I see is one of perception. Making
'ext4' means labelling it unstable for a while. And once something like a
_filesystem_ is called unstable, it's going to be a long time before
people trust it with terabytes of their incredibly valuable data (even if
we promise them that it's mostly an ext3 fork).
Whereas if you play with some experimental 48-bit extension on ext3, well,
ext3 already has a good reputation and is in use everywhere, so maybe this
isn't a bad "last feature" to add before forking off into ext4-land?
> gerrit
Cheers,
Chase
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 17:58 ` Gerrit Huizenga
2006-06-09 18:25 ` [Ext2-devel] " Chase Venters
@ 2006-06-10 13:46 ` Adrian Bunk
2006-06-10 14:42 ` Ingo Molnar
2006-06-13 13:34 ` [Ext2-devel] " Helge Hafting
2 siblings, 1 reply; 295+ messages in thread
From: Adrian Bunk @ 2006-06-10 13:46 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel, Andreas Dilger
On Fri, Jun 09, 2006 at 10:58:00AM -0700, Gerrit Huizenga wrote:
>
> On Fri, 09 Jun 2006 09:09:01 PDT, Linus Torvalds wrote:
> > On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
> > >
> > > Jeff's approach taken to the rediculous would mean that we'd have
> > > ext versions 1-40 by now at least. I don't think that helps much,
> > > either.
> >
> > On the other hand, I _guarantee_ you that it helps that we have ext2-3,
> > and not just ext2 (nobody even tried to keep ext1 compatible, thank the
> > Gods).
>
> I had originally argued for ext4 as well based on the fact that it would
> allow lots of potential cleanups & simplifications and at the same time
> would allow a break in the on disk filesystems layout.
>
> These changes don't yet change the actual on-disk layout and that might
> be something that would be done if ext4 were a real, new filesystem.
>
> But then how long until ext4 is used enough to be put into production?
> How much testing will it *really* get in any form? How long before
> the people that are using 100 TB+ disk farms today (some of which are
> chopping filesystems into 2-8 GB chunks, others with 2 TB filesystems
> today) actually trust this new filesystem (most vendors don't support
> JFS today, XFS support isn't much better).
You want to get the new features into ext3 instead of creating ext4 for
getting them better tested.
Other people in this thread want to get the new features into ext3
instead of creating ext4 telling that this won't do harm for existing
users since users will have to explicitely enable it.
Hearing people using contrary arguments in the same discussion always
sounds as if they don't actually know what they want to do...
> We are seeing storage needs increasing at a frightening rate. Health
> Care folks want to store your MRI's, x-ray's, ultraounds, etc. in high
> res digital format across your entire life in near-line format. Terabytes
> over time per person. Europe is already doing this pretty extensively,
> the US is following suit. Digital media creation has huge storage needs.
> Most everything is moving to podcasts, webcasts, streaming audio & video.
> Storage is huge, and ext3 is at the current breaking point.
>
> I'd argue that whatever we call it, we need a standard, stable, supported
> solution *soon* for large files, large filesystems, large storage systems
> in Linux.
>
> I'd think the quickest path is to relieve the pressure now in ext3.
Why aren't JFS and XFS good enough for relieving the pressure now?
> We still haven't solved the filesystem check time problem, which is the
> next big bugaboo. But getting large fileysstems to real customers soon,
> e.g. in mainline, well tested, ready for distro support is my real goal.
>...
Other people have the "no regressions for existing ext3 users" goal.
> gerrit
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 13:46 ` Adrian Bunk
@ 2006-06-10 14:42 ` Ingo Molnar
2006-06-10 15:03 ` Jeff Garzik
` (3 more replies)
0 siblings, 4 replies; 295+ messages in thread
From: Ingo Molnar @ 2006-06-10 14:42 UTC (permalink / raw)
To: Adrian Bunk
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
* Adrian Bunk <bunk@stusta.de> wrote:
> > I'd argue that whatever we call it, we need a standard, stable,
> > supported solution *soon* for large files, large filesystems, large
> > storage systems in Linux.
> >
> > I'd think the quickest path is to relieve the pressure now in ext3.
>
> Why aren't JFS and XFS good enough for relieving the pressure now?
Compatibility? Upgradability? Simplicity? Supportability?
Even ignoring all those arguments, i find your "ext3/ext4 is too
complex, use XFS or JFS" argument a bit naive. Please take a quick look
at the linecount of the filesystems in question:
LOC
------------------
ext2: 7492
ext3+jbd: 22197
ext4+jbd: 24312
reiser3: 28857
reiser4: 79189
JFS: 32819
XFS: 110718
the ext3 -> ext4 patches add +2115 lines of code (which 2115 lines solve
the biggest performance and scaling problem ext3 currently has), which
is 1.9% of the linecount of XFS.
Q.E.D.
> > We still haven't solved the filesystem check time problem, which is the
> > next big bugaboo. But getting large fileysstems to real customers soon,
> > e.g. in mainline, well tested, ready for distro support is my real goal.
> >...
>
> Other people have the "no regressions for existing ext3 users" goal.
frankly, i'll leave that decision to the ext3 developers and obviously,
to distributors. Their filesystem has handled my data for 10 years, and
they have been very conservative about their technical choices
throughout. I trust them to not mess up this time either.
ext3 does quite a few things to stay compatible with ext2 - and frankly,
i very much expected it to do that when i migrated my ext2 data to ext3.
The days of "change the world in an incompatible way and dont look back"
are gone.
Ingo
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 14:42 ` Ingo Molnar
@ 2006-06-10 15:03 ` Jeff Garzik
2006-06-11 6:00 ` Ingo Molnar
2006-06-10 16:00 ` Adrian Bunk
` (2 subsequent siblings)
3 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 15:03 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, ext2-devel, linux-kernel, Adrian Bunk,
Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
Ingo Molnar wrote:
> the ext3 -> ext4 patches add +2115 lines of code (which 2115 lines solve
> the biggest performance and scaling problem ext3 currently has), which
> is 1.9% of the linecount of XFS.
Indeed!
> ext3 does quite a few things to stay compatible with ext2 - and frankly,
> i very much expected it to do that when i migrated my ext2 data to ext3.
> The days of "change the world in an incompatible way and dont look back"
> are gone.
I agree with your point in the thread -- most users and distros don't
change their main fs on a whim. But I also point out that these
extent+48bit changes _do_ change the format in an incompatible way...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 15:03 ` Jeff Garzik
@ 2006-06-11 6:00 ` Ingo Molnar
0 siblings, 0 replies; 295+ messages in thread
From: Ingo Molnar @ 2006-06-11 6:00 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Adrian Bunk,
Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
* Jeff Garzik <jeff@garzik.org> wrote:
> I agree with your point in the thread -- most users and distros don't
> change their main fs on a whim. But I also point out that these
> extent+48bit changes _do_ change the format in an incompatible way...
yeah. /me learns to not post too much while watching football ;)
Ingo
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 14:42 ` Ingo Molnar
2006-06-10 15:03 ` Jeff Garzik
@ 2006-06-10 16:00 ` Adrian Bunk
2006-06-10 16:05 ` Christoph Hellwig
2006-06-10 23:05 ` Mike Galbraith
3 siblings, 0 replies; 295+ messages in thread
From: Adrian Bunk @ 2006-06-10 16:00 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
On Sat, Jun 10, 2006 at 04:42:28PM +0200, Ingo Molnar wrote:
>
> * Adrian Bunk <bunk@stusta.de> wrote:
>
> > > I'd argue that whatever we call it, we need a standard, stable,
> > > supported solution *soon* for large files, large filesystems, large
> > > storage systems in Linux.
> > >
> > > I'd think the quickest path is to relieve the pressure now in ext3.
> >
> > Why aren't JFS and XFS good enough for relieving the pressure now?
>
> Compatibility? Upgradability? Simplicity? Supportability?
>
> Even ignoring all those arguments, i find your "ext3/ext4 is too
> complex, use XFS or JFS" argument a bit naive. Please take a quick look
> at the linecount of the filesystems in question:
>...
You missed my point (or I didn't make it clear enough):
It's no question that an improved version of ext3 will be available.
The only question is whether it will be ext3 or ext4.
My point was that if it takes a bit longer in the ext4 case, and during
this time some people have this pressure of requiring it, they have the
workaround of using other file systems.
Whether the "improve ext3" or the ext4 approach are better is a
different question. Whether ext3 is better than XFS is also not what I
was talking about.
It's simply that for the few people who need it now, other file systems
are available as a workaround.
> Ingo
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 14:42 ` Ingo Molnar
2006-06-10 15:03 ` Jeff Garzik
2006-06-10 16:00 ` Adrian Bunk
@ 2006-06-10 16:05 ` Christoph Hellwig
2006-06-10 23:05 ` Mike Galbraith
3 siblings, 0 replies; 295+ messages in thread
From: Christoph Hellwig @ 2006-06-10 16:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, Adrian Bunk,
Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
On Sat, Jun 10, 2006 at 04:42:28PM +0200, Ingo Molnar wrote:
> Even ignoring all those arguments, i find your "ext3/ext4 is too
> complex, use XFS or JFS" argument a bit naive. Please take a quick look
> at the linecount of the filesystems in question:
That isn't interesting at all. There's a lot more interesting features
in jfs and xfs. XFS is still quite bloated even compared to it's features,
but it's doing much more than just and ext3+extents. At a smaller scale
that's true for jfs aswell.
As mentioned a few times below just getting over the 8TB barrier is far
from enough forthe next gen linux filesystems. XFS already goes on to
address the Petabyte barrier. It's not like it couldn't address Petabytes
of storage from the very beginning but you have such problems as needing
a parallel fsck, fault tolerance, lots of parallelism in the filesystem
and things like delayed allocations to hit linerate on dozends of FC HBAs
in the system.
And not, I don't want to bitch about ext3, it's doing good work for my
on most of my machines, but it's definitly not what I would want to
scale to really large filesystems. It's UFS done right, but the time
of UFS derivates is slowly passing.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 14:42 ` Ingo Molnar
` (2 preceding siblings ...)
2006-06-10 16:05 ` Christoph Hellwig
@ 2006-06-10 23:05 ` Mike Galbraith
3 siblings, 0 replies; 295+ messages in thread
From: Mike Galbraith @ 2006-06-10 23:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, Adrian Bunk,
Linus Torvalds, Gerrit Huizenga, cmm, linux-fsdevel, Alex Tomas,
Andreas Dilger
On Sat, 2006-06-10 at 16:42 +0200, Ingo Molnar wrote:
> frankly, i'll leave that decision to the ext3 developers and obviously,
> to distributors. Their filesystem has handled my data for 10 years, and
> they have been very conservative about their technical choices
> throughout. I trust them to not mess up this time either.
That's my view in nut shell (minus distributors). Add caps.
-Mike
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 17:58 ` Gerrit Huizenga
2006-06-09 18:25 ` [Ext2-devel] " Chase Venters
2006-06-10 13:46 ` Adrian Bunk
@ 2006-06-13 13:34 ` Helge Hafting
2 siblings, 0 replies; 295+ messages in thread
From: Helge Hafting @ 2006-06-13 13:34 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Linus Torvalds, Alex Tomas, Jeff Garzik, Andrew Morton,
ext2-devel, linux-kernel, cmm, linux-fsdevel, Andreas Dilger
Gerrit Huizenga wrote:
> On Fri, 09 Jun 2006 09:09:01 PDT, Linus Torvalds wrote:
>
>> On Fri, 9 Jun 2006, Gerrit Huizenga wrote:
>>
>>> Jeff's approach taken to the rediculous would mean that we'd have
>>> ext versions 1-40 by now at least. I don't think that helps much,
>>> either.
>>>
>> On the other hand, I _guarantee_ you that it helps that we have ext2-3,
>> and not just ext2 (nobody even tried to keep ext1 compatible, thank the
>> Gods).
>>
>
> I had originally argued for ext4 as well based on the fact that it would
> allow lots of potential cleanups & simplifications and at the same time
> would allow a break in the on disk filesystems layout.
>
> These changes don't yet change the actual on-disk layout and that might
> be something that would be done if ext4 were a real, new filesystem.
>
> But then how long until ext4 is used enough to be put into production?
>
No problem. It didn't take long for ext3 - it won't take long for ext4.
First, you have developers and some enthusiasts using it.
Then, you get the thousands of people who like living
on the edge using ext4. As soon as it doesn't have bad known bugs.
Then some distros pick it up, wanting to be first with
large-disk support.
After that, it is considered "harmless".
If a break in on-disk layout is useful, then the time is now while
a new fs is introduced anyway. It could be 7 years to the next chance.
Helge Hafting
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:08 ` Jeff Garzik
2006-06-09 15:25 ` Jeff Garzik
2006-06-09 15:28 ` [Ext2-devel] " Alex Tomas
@ 2006-06-09 20:32 ` Stephen C. Tweedie
2006-06-09 20:46 ` Linus Torvalds
2 siblings, 1 reply; 295+ messages in thread
From: Stephen C. Tweedie @ 2006-06-09 20:32 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Stephen Tweedie, ext2-devel@lists.sourceforge.net,
linux-kernel, Linus Torvalds, Mingming Cao, linux-fsdevel,
Andreas Dilger
Hi,
On Fri, 2006-06-09 at 11:08 -0400, Jeff Garzik wrote:
> Stuffing more and more features into fs/ext3 means you are following the
> path that leads to reiser4... where EVERYTHING under the hood is
> mutable, all within fs/ext3.
> Why do you insist upon calling the end result ext3, when the truth is
> that you are slowing rewriting ext3?
The trouble is, does it make sense to do otherwise?
Should large file support have resulted in ext4? ACLs/xattrs, ext5?
Htree, ext6? Online resize, ext7? Yes, let's make it ext8 for extents!
> Here's a key question for ext3 developers, which I bet has no answer:
> when is it enough?
When is the Linux syscall interface enough? When should we just bump it
and cut out all the compatibility interfaces?
No, we don't; we let people configure certain obsolete bits out (a.out
support etc), but we keep it in the tree despite the indirection cost to
maintain multiple interfaces etc.
> > While this is partly true, one of the big benefits is that you can
> > transparently upgrade your system to use the new features and improve
> > performance without a long outage window. Having a completely separate
>
> Changing the name to ext4 doesn't erase this capability.
The name is irrelevant here. FWIW, something we've considered is to
make the user visibility of a batch of new features more obvious by
labelling them "ext4", so "mke4fs" would automatically enable those
features and the filesystem could register "ext4" as an fs type in the
kernel.
But that could be done without forking the codebase. It would just be a
matter of binding feature flag sets to the given name. What you're
talking about is forking the codebase itself, and I don't see the need
for that right now.
> > ext4 filesystem doesn't improve the compatibility story at all. There
> > has been renewed discussion on implementing "mounting ext3 without a
> > journal", just for a recovery mode, because ext2 will not be modified
> > to get all of these features (running e2fsck on a huge filesystem each
> > reboot would be insane).
>
> So now you are going backwards, and implementing ext2-within-ext3?
No, it would be a readonly emergency mode, not writable ext2 at all.
> Are you ready to admit, yet, that ext3 is 100% mutable in the minds of
> ext3 developers?
The kernel syscall interface is 100% mutable by the same criteria.
Except in each case it's not "mutable", it's "extensible", which is a
*far* different thing.
> If all the ext3 developers are on board, that just implies that there is
> no clear definition of what "ext3" really means. With this patch
> series, and with future plans described here and elsewhere, the name
> "ext3" will become more and more meaningless.
Does the continuing addition of futexes, inotify, $FAVOURITE_FEATURE_OF_
THE_DAY mean that "Linux" is more and more meaningless? I fail to see
much difference. An application coded for linux-2.0's public interfaces
will, for the most part, if we do our jobs right, continue to work on
2.6. An application coded for 2.6, expecting to use AIO, large files,
futexes and NPTL, will definitely not run on 2.0. The incremental
extension of ext3 doesn't seem to be a fundamentally different concept.
Backwards compatibility of the kernel ABI is considered important; so in
ext3, the developers have a high regard for backwards compatibility of
on-disk data. Personally I see that as an asset, not a problem; indeed,
it was the single most important design criterion from the outset.
--Stephen
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:32 ` Stephen C. Tweedie
@ 2006-06-09 20:46 ` Linus Torvalds
2006-06-09 20:56 ` Alex Tomas
2006-06-20 6:15 ` [Ext2-devel] " Qi Yong
0 siblings, 2 replies; 295+ messages in thread
From: Linus Torvalds @ 2006-06-09 20:46 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Andrew Morton, Jeff Garzik, ext2-devel@lists.sourceforge.net,
linux-kernel, Mingming Cao, linux-fsdevel, Andreas Dilger
On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>
> When is the Linux syscall interface enough? When should we just bump it
> and cut out all the compatibility interfaces?
>
> No, we don't; we let people configure certain obsolete bits out (a.out
> support etc), but we keep it in the tree despite the indirection cost to
> maintain multiple interfaces etc.
Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN WAYS THAT
MIGHT BREAK OLD USERS.
Your point was exactly what?
Btw, where did that 2TB limit number come from? Afaik, it should be 16TB
for a 4kB filesystem, no?
Linus
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:46 ` Linus Torvalds
@ 2006-06-09 20:56 ` Alex Tomas
2006-06-20 6:15 ` [Ext2-devel] " Qi Yong
1 sibling, 0 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 20:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Jeff Garzik, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
linux-fsdevel, Andreas Dilger
>>>>> Linus Torvalds (LT) writes:
LT> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
LT> Btw, where did that 2TB limit number come from? Afaik, it should be 16TB
LT> for a 4kB filesystem, no?
2TB => 16K group descriptors * 8 (sizeof(void*) on 64bit arch) => 128K -- slab limit
we have a fix for this as well.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 20:46 ` Linus Torvalds
2006-06-09 20:56 ` Alex Tomas
@ 2006-06-20 6:15 ` Qi Yong
2006-06-20 8:26 ` Laurent Vivier
1 sibling, 1 reply; 295+ messages in thread
From: Qi Yong @ 2006-06-20 6:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Stephen C. Tweedie, Jeff Garzik, Andreas Dilger, Andrew Morton,
ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
linux-fsdevel, alex
Linus Torvalds wrote:
>On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>
>
>>When is the Linux syscall interface enough? When should we just bump it
>>and cut out all the compatibility interfaces?
>>
>>No, we don't; we let people configure certain obsolete bits out (a.out
>>support etc), but we keep it in the tree despite the indirection cost to
>>maintain multiple interfaces etc.
>>
>>
>
>Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN WAYS THAT
>MIGHT BREAK OLD USERS.
>
>Your point was exactly what?
>
>Btw, where did that 2TB limit number come from? Afaik, it should be 16TB
>for a 4kB filesystem, no?
>
>
Partition tables describe partitions in units of one sector.
2^(32+9) = 2T
To prevent integer overflow, we should use only 31 bits of a 32-bit integer.
2^(31+12) = 8T
There's _terrible_ hacks to really get to 16T.
-- qiyong
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-20 6:15 ` [Ext2-devel] " Qi Yong
@ 2006-06-20 8:26 ` Laurent Vivier
2006-06-20 8:30 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Laurent Vivier @ 2006-06-20 8:26 UTC (permalink / raw)
To: Qi Yong
Cc: Linus Torvalds, Andrew Morton, Jeff Garzik, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
linux-fsdevel, alex, Andreas Dilger
[-- Attachment #1: Type: text/plain, Size: 1290 bytes --]
Qi Yong wrote:
> Linus Torvalds wrote:
>
>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>
>>
>>> When is the Linux syscall interface enough? When should we just bump it
>>> and cut out all the compatibility interfaces?
>>>
>>> No, we don't; we let people configure certain obsolete bits out (a.out
>>> support etc), but we keep it in the tree despite the indirection cost to
>>> maintain multiple interfaces etc.
>>>
>>>
>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN WAYS THAT
>> MIGHT BREAK OLD USERS.
>>
>> Your point was exactly what?
>>
>> Btw, where did that 2TB limit number come from? Afaik, it should be 16TB
>> for a 4kB filesystem, no?
>>
>>
>
> Partition tables describe partitions in units of one sector.
> 2^(32+9) = 2T
>
> To prevent integer overflow, we should use only 31 bits of a 32-bit integer.
> 2^(31+12) = 8T
>
> There's _terrible_ hacks to really get to 16T.
>
> -- qiyong
>
IMHO, a simple solution is to use "Logical Volume Manager" instead of partition
manager: we create 64bit filesystem in a Logical Volume, not in a partition.
"partitioning is obsolete" ;-)
Regards,
Laurent
--
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-20 8:26 ` Laurent Vivier
@ 2006-06-20 8:30 ` Jeff Garzik
2006-06-20 9:21 ` Laurent Vivier
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-20 8:30 UTC (permalink / raw)
To: Laurent Vivier
Cc: Qi Yong, Linus Torvalds, Andrew Morton, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
linux-fsdevel, alex, Andreas Dilger
Laurent Vivier wrote:
> Qi Yong wrote:
>> Linus Torvalds wrote:
>>
>>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>>
>>>
>>>> When is the Linux syscall interface enough? When should we just bump it
>>>> and cut out all the compatibility interfaces?
>>>>
>>>> No, we don't; we let people configure certain obsolete bits out (a.out
>>>> support etc), but we keep it in the tree despite the indirection cost to
>>>> maintain multiple interfaces etc.
>>>>
>>>>
>>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN WAYS THAT
>>> MIGHT BREAK OLD USERS.
>>>
>>> Your point was exactly what?
>>>
>>> Btw, where did that 2TB limit number come from? Afaik, it should be 16TB
>>> for a 4kB filesystem, no?
>>>
>>>
>> Partition tables describe partitions in units of one sector.
>> 2^(32+9) = 2T
>>
>> To prevent integer overflow, we should use only 31 bits of a 32-bit integer.
>> 2^(31+12) = 8T
>>
>> There's _terrible_ hacks to really get to 16T.
>>
>> -- qiyong
>>
>
> IMHO, a simple solution is to use "Logical Volume Manager" instead of partition
> manager: we create 64bit filesystem in a Logical Volume, not in a partition.
That doesn't solve anything, if you are not using a 64bit filesystem.
> "partitioning is obsolete" ;-)
LVM is nothing but a partition manager...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-20 8:30 ` Jeff Garzik
@ 2006-06-20 9:21 ` Laurent Vivier
2006-06-20 9:48 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Laurent Vivier @ 2006-06-20 9:21 UTC (permalink / raw)
To: Jeff Garzik
Cc: Qi Yong, Linus Torvalds, Andrew Morton, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Mingming Cao,
linux-fsdevel, alex, Andreas Dilger
[-- Attachment #1: Type: text/plain, Size: 2386 bytes --]
Jeff Garzik wrote:
> Laurent Vivier wrote:
>> Qi Yong wrote:
>>> Linus Torvalds wrote:
>>>
>>>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>>>
>>>>
>>>>> When is the Linux syscall interface enough? When should we just
>>>>> bump it
>>>>> and cut out all the compatibility interfaces?
>>>>>
>>>>> No, we don't; we let people configure certain obsolete bits out (a.out
>>>>> support etc), but we keep it in the tree despite the indirection
>>>>> cost to
>>>>> maintain multiple interfaces etc.
>>>>>
>>>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN
>>>> WAYS THAT MIGHT BREAK OLD USERS.
>>>>
>>>> Your point was exactly what?
>>>>
>>>> Btw, where did that 2TB limit number come from? Afaik, it should be
>>>> 16TB for a 4kB filesystem, no?
>>>>
>>>>
>>> Partition tables describe partitions in units of one sector.
>>> 2^(32+9) = 2T
>>>
>>> To prevent integer overflow, we should use only 31 bits of a 32-bit
>>> integer.
>>> 2^(31+12) = 8T
>>>
>>> There's _terrible_ hacks to really get to 16T.
>>>
>>> -- qiyong
>>>
>>
>> IMHO, a simple solution is to use "Logical Volume Manager" instead of
>> partition
>> manager: we create 64bit filesystem in a Logical Volume, not in a
>> partition.
>
> That doesn't solve anything, if you are not using a 64bit filesystem.
Sorry, I don't undestand why ???
You can use 32bit filesystem too, but you limit the size of the logical volume
to be compatible with the filesystem you use. LVM allows to create several 32bit
volumes on a big (> 8T) disk (if exists)
But if we think further, as biggest disk is 750 GB (and I think even using HW
RAID, there is an HW limit something like 4 TB), we can imagine a big Volume
Group belonging several Physical Volumes divided in Logical Volumes: so we
already use LVM, we don't need partition...)
>> "partitioning is obsolete" ;-)
>
> LVM is nothing but a partition manager...
LVM is more than a partition manager:
- it is arch-independent
- it is 64bit compliant
- it can gather together several disks
- it is flexible (you can add/remove/resize volume)
- it is modern (doesn't have primary/extended partition, doesn't have limited
number of partition)
so... it's a volume manager.
Regards,
Laurent
--
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-20 9:21 ` Laurent Vivier
@ 2006-06-20 9:48 ` Jeff Garzik
2006-06-20 10:40 ` Laurent Vivier
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-20 9:48 UTC (permalink / raw)
To: Laurent Vivier
Cc: Andrew Morton, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, alex, Andreas Dilger, Qi Yong
Laurent Vivier wrote:
> Jeff Garzik wrote:
>> Laurent Vivier wrote:
>>> Qi Yong wrote:
>>>> Linus Torvalds wrote:
>>>>
>>>>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>>>>
>>>>>
>>>>>> When is the Linux syscall interface enough? When should we just
>>>>>> bump it
>>>>>> and cut out all the compatibility interfaces?
>>>>>>
>>>>>> No, we don't; we let people configure certain obsolete bits out (a.out
>>>>>> support etc), but we keep it in the tree despite the indirection
>>>>>> cost to
>>>>>> maintain multiple interfaces etc.
>>>>>>
>>>>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN
>>>>> WAYS THAT MIGHT BREAK OLD USERS.
>>>>>
>>>>> Your point was exactly what?
>>>>>
>>>>> Btw, where did that 2TB limit number come from? Afaik, it should be
>>>>> 16TB for a 4kB filesystem, no?
>>>>>
>>>>>
>>>> Partition tables describe partitions in units of one sector.
>>>> 2^(32+9) = 2T
>>>>
>>>> To prevent integer overflow, we should use only 31 bits of a 32-bit
>>>> integer.
>>>> 2^(31+12) = 8T
>>>>
>>>> There's _terrible_ hacks to really get to 16T.
>>>>
>>>> -- qiyong
>>>>
>>> IMHO, a simple solution is to use "Logical Volume Manager" instead of
>>> partition
>>> manager: we create 64bit filesystem in a Logical Volume, not in a
>>> partition.
>> That doesn't solve anything, if you are not using a 64bit filesystem.
>
> Sorry, I don't undestand why ???
>
> You can use 32bit filesystem too, but you limit the size of the logical volume
> to be compatible with the filesystem you use. LVM allows to create several 32bit
> volumes on a big (> 8T) disk (if exists)
Let's review the thread:
qiyong: <these limits> exist in the filesystem
you: bust those limits with LVM!
I think you are misunderstanding the subthread.
>>> "partitioning is obsolete" ;-)
>> LVM is nothing but a partition manager...
>
> LVM is more than a partition manager:
I am well aware of what LVM2 and device mapper can do.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-20 9:48 ` Jeff Garzik
@ 2006-06-20 10:40 ` Laurent Vivier
0 siblings, 0 replies; 295+ messages in thread
From: Laurent Vivier @ 2006-06-20 10:40 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Stephen C. Tweedie,
ext2-devel@lists.sourceforge.net, linux-kernel, Linus Torvalds,
Mingming Cao, linux-fsdevel, alex, Andreas Dilger, Qi Yong
[-- Attachment #1.1: Type: text/plain, Size: 2246 bytes --]
Jeff Garzik wrote:
> Laurent Vivier wrote:
>> Jeff Garzik wrote:
>>> Laurent Vivier wrote:
>>>> Qi Yong wrote:
>>>>> Linus Torvalds wrote:
>>>>>
>>>>>> On Fri, 9 Jun 2006, Stephen C. Tweedie wrote:
>>>>>>
>>>>>>
>>>>>>> When is the Linux syscall interface enough? When should we just
>>>>>>> bump it
>>>>>>> and cut out all the compatibility interfaces?
>>>>>>>
>>>>>>> No, we don't; we let people configure certain obsolete bits out
>>>>>>> (a.out
>>>>>>> support etc), but we keep it in the tree despite the indirection
>>>>>>> cost to
>>>>>>> maintain multiple interfaces etc.
>>>>>>>
>>>>>> Right. WE ADD NEW SYSTEM CALLS. WE DO NOT EXTEND THE OLD ONES IN
>>>>>> WAYS THAT MIGHT BREAK OLD USERS.
>>>>>>
>>>>>> Your point was exactly what?
>>>>>>
>>>>>> Btw, where did that 2TB limit number come from? Afaik, it should be
>>>>>> 16TB for a 4kB filesystem, no?
>>>>>>
>>>>>>
>>>>> Partition tables describe partitions in units of one sector.
>>>>> 2^(32+9) = 2T
>>>>>
>>>>> To prevent integer overflow, we should use only 31 bits of a 32-bit
>>>>> integer.
>>>>> 2^(31+12) = 8T
>>>>>
>>>>> There's _terrible_ hacks to really get to 16T.
>>>>>
>>>>> -- qiyong
>>>>>
>>>> IMHO, a simple solution is to use "Logical Volume Manager" instead of
>>>> partition
>>>> manager: we create 64bit filesystem in a Logical Volume, not in a
>>>> partition.
>>> That doesn't solve anything, if you are not using a 64bit filesystem.
>>
>> Sorry, I don't undestand why ???
>>
>> You can use 32bit filesystem too, but you limit the size of the
>> logical volume
>> to be compatible with the filesystem you use. LVM allows to create
>> several 32bit
>> volumes on a big (> 8T) disk (if exists)
>
> Let's review the thread:
>
> qiyong: <these limits> exist in the filesystem
> you: bust those limits with LVM!
>
> I think you are misunderstanding the subthread.
>
Yes...
I understood:
qiyong: <these limits> exist in the partition manager.
(because with patches proposed on http://www.bullopensource.org/ext4 these
limits don't exist anymore in the filesystem.)
Regards,
Laurent
--
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
[-- Attachment #3: Type: text/plain, Size: 161 bytes --]
_______________________________________________
Ext2-devel mailing list
Ext2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ext2-devel
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 2:49 ` Jeff Garzik
2006-06-09 8:35 ` Andreas Dilger
@ 2006-06-09 17:14 ` Alan Cox
1 sibling, 0 replies; 295+ messages in thread
From: Alan Cox @ 2006-06-09 17:14 UTC (permalink / raw)
To: Jeff Garzik
Cc: cmm, Andrew Morton, Linus Torvalds, linux-kernel, ext2-devel,
linux-fsdevel
Ar Iau, 2006-06-08 am 22:49 -0400, ysgrifennodd Jeff Garzik:
> People (including me) still switch back and forth between ext2 and ext3
> mounts of the same filesystem on occasion. I think creating an "ext4"
> would allow for greater developer flexibility in implementing new
> features and ditching old ones -- while also emphasizing to the user
> that switching back and forth between ext4 and ext[23] is a bad idea.
I would agree with this, particularly as ext3 and ext4 are quite small
in the kernel side of things and people needing 48bit extents are
probably not trying to run on 8MB of flash.
Alan
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
2006-06-09 2:40 ` Valdis.Kletnieks
2006-06-09 2:49 ` Jeff Garzik
@ 2006-06-09 9:13 ` Christoph Hellwig
2006-06-09 10:07 ` Andrew Morton
` (2 more replies)
2006-06-30 0:16 ` [RFC][Update 0/16]extents and 48bit ext3/4 patches Mingming Cao
` (16 subsequent siblings)
19 siblings, 3 replies; 295+ messages in thread
From: Christoph Hellwig @ 2006-06-09 9:13 UTC (permalink / raw)
To: Mingming Cao; +Cc: linux-kernel, ext2-devel, linux-fsdevel
On Thu, Jun 08, 2006 at 06:20:54PM -0700, Mingming Cao wrote:
> Current ext3 filesystem is limited to 8TB(4k block size), this is
> practically not enough for the increasing need of bigger storage as
> disks in a few years (or even now).
>
> To address this need, there are co-effort from RedHat, ClusterFS, IBM
> and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> ext3 is build on top of extent map changes for ext3, originally from
> Alex Tomas. In short, the new ext3 on-disk extents format is:
What a horrible idea! The nice things about ext3 are:
- the rather simple and thus reliable implementation
- the lack of incompatible ondisk changes
and the block numbers are't the big problem concerning scalability, there's
a lot more to it, like btree(-like) structures in the allocator, parallel
alloocator algorithms and a better allocation group concept.
If you guys want big storage on linux please help improving the filesystems
design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
both making ext3 less reliable for us desktop/small server users and not get
the full thing for the big storage people either.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 9:13 ` Christoph Hellwig
@ 2006-06-09 10:07 ` Andrew Morton
2006-06-09 15:40 ` Jeff Garzik
2006-06-09 10:49 ` Andreas Dilger
2006-06-09 11:26 ` Alex Tomas
2 siblings, 1 reply; 295+ messages in thread
From: Andrew Morton @ 2006-06-09 10:07 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, ext2-devel, cmm, linux-kernel
On Fri, 9 Jun 2006 10:13:27 +0100
Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jun 08, 2006 at 06:20:54PM -0700, Mingming Cao wrote:
> > Current ext3 filesystem is limited to 8TB(4k block size), this is
> > practically not enough for the increasing need of bigger storage as
> > disks in a few years (or even now).
> >
> > To address this need, there are co-effort from RedHat, ClusterFS, IBM
> > and BULL to move ext3 from 32 bit filesystem to 48 bit filesystem,
> > expanding ext3 filesystem limit from 8TB today to 1024 PB. The 48 bit
> > ext3 is build on top of extent map changes for ext3, originally from
> > Alex Tomas. In short, the new ext3 on-disk extents format is:
>
> What a horrible idea! The nice things about ext3 are:
>
> - the rather simple and thus reliable implementation
JBD isn't simple. I don't think there's a need in this project to make
algorithmic changes in either JBD or htree, thankfully.
> - the lack of incompatible ondisk changes
Ted&co have been pretty good at avoiding compatibility problems.
> and the block numbers are't the big problem concerning scalability, there's
> a lot more to it, like btree(-like) structures in the allocator, parallel
> alloocator algorithms and a better allocation group concept.
The performance testing results I've seen for a few of the components of
this project have been rather good, and that's the bottom line.
I don't know how the end result would compare in a bakeoff against XFS, and
I doubt if we know how much XFS performance would be improved if this
effort were diverted into that project.
But I don't think it's all as clear-cut as you imply.
> If you guys want big storage on linux please help improving the filesystems
> design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
> both making ext3 less reliable for us desktop/small server users and not get
> the full thing for the big storage people either.
There have been pretty big changes in ext3 post-2.6.early and we've been OK
at avoiding breakage thus far. It all comes down to how well the new
codepaths manage to avoid altering the existing ones.
That being said, ext3 isn't exactly .... modern. One day we'll need
something better.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 10:07 ` Andrew Morton
@ 2006-06-09 15:40 ` Jeff Garzik
2006-06-09 15:42 ` Matthew Wilcox
` (2 more replies)
0 siblings, 3 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:40 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Hellwig, cmm, linux-kernel, ext2-devel, linux-fsdevel
Andrew Morton wrote:
> Ted&co have been pretty good at avoiding compatibility problems.
Well, extents and 48bit make that track record demonstrably worse.
Users are now forced to remember that, if they write to their filesystem
after using either $mmver or $korgver kernels, they are locked out of
using older kernels.
From the user's perspective, ext3 has no clear "metadata version 1",
"metadata version 2" division. Thus they are now forced to keep a
matrix of kernel versions and ext3 feature flag support, to know which
kernels are usable with which data. It is a support nightmare.
At no point is a user ever told, in big capital letters, "IF YOU WRITE
TO THIS FILESYSTEM, YOU CAN'T BOOT OLDER KERNELS." There is no "click
OK to continue with this dramatic event."
And as features continue to be added in this manner, this problem gets
_exponentially_ worse.
On the project management side of things, I see no indication that this
momentum slow -- which implies to me that people will keep slapping new
stuff into ext3, rather than directing energy towards a newer, cleaner
ext-NG filesystem.
Dragging around back-compat really constrains freedom, and you have to
have some sort of "pressure relief valve" (a massive, wildly
incompatible update) eventually.
In my mind, it's analagous to locking developers into developing and
deploying new features into a stable branch of software. The hacks just
get worse and worse, as you bend over backwards for back-compat.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:40 ` Jeff Garzik
@ 2006-06-09 15:42 ` Matthew Wilcox
2006-06-09 15:51 ` Jeff Garzik
2006-06-09 17:29 ` Alan Cox
2006-06-09 16:56 ` Andrew Morton
2006-06-09 18:23 ` Michael Poole
2 siblings, 2 replies; 295+ messages in thread
From: Matthew Wilcox @ 2006-06-09 15:42 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Christoph Hellwig, cmm,
linux-fsdevel
On Fri, Jun 09, 2006 at 11:40:03AM -0400, Jeff Garzik wrote:
> Users are now forced to remember that, if they write to their filesystem
> after using either $mmver or $korgver kernels, they are locked out of
> using older kernels.
>
> From the user's perspective, ext3 has no clear "metadata version 1",
> "metadata version 2" division. Thus they are now forced to keep a
> matrix of kernel versions and ext3 feature flag support, to know which
> kernels are usable with which data. It is a support nightmare.
Hang on, you're going too far. You have to enable extents with the
extent mount option. Otherwise you don't get to use them. The user
does, in fact, have a clear division, although maybe the blinky signs
aren't quite luminous enough.
I still think making ext3 bigger than 16TB is just silly.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:42 ` Matthew Wilcox
@ 2006-06-09 15:51 ` Jeff Garzik
2006-06-09 17:29 ` Alan Cox
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:51 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Christoph Hellwig, cmm, linux-kernel, ext2-devel,
linux-fsdevel
Matthew Wilcox wrote:
> On Fri, Jun 09, 2006 at 11:40:03AM -0400, Jeff Garzik wrote:
>> Users are now forced to remember that, if they write to their filesystem
>> after using either $mmver or $korgver kernels, they are locked out of
>> using older kernels.
>>
>> From the user's perspective, ext3 has no clear "metadata version 1",
>> "metadata version 2" division. Thus they are now forced to keep a
>> matrix of kernel versions and ext3 feature flag support, to know which
>> kernels are usable with which data. It is a support nightmare.
>
> Hang on, you're going too far. You have to enable extents with the
> extent mount option. Otherwise you don't get to use them. The user
> does, in fact, have a clear division, although maybe the blinky signs
> aren't quite luminous enough.
...and how are distros going to deploy this? They are going to turn on
extents by default.
And do we honestly think that is a scalable option _anyway_? That will
slowly bloat fstab and mount command lines with an ever-increasing list
of options.
It's IMO better experience for the user, and gives the developers more
freedom.Look, I _really_ want extents. I am a big fan. But I think
that extents are good time to make a clean break, and let ext3 live as
it is. And it will let ext3 stabilize.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:42 ` Matthew Wilcox
2006-06-09 15:51 ` Jeff Garzik
@ 2006-06-09 17:29 ` Alan Cox
1 sibling, 0 replies; 295+ messages in thread
From: Alan Cox @ 2006-06-09 17:29 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Christoph Hellwig, cmm, linux-fsdevel
Ar Gwe, 2006-06-09 am 09:42 -0600, ysgrifennodd Matthew Wilcox:
> Hang on, you're going too far. You have to enable extents with the
> extent mount option. Otherwise you don't get to use them. The user
> does, in fact, have a clear division, although maybe the blinky signs
> aren't quite luminous enough.
<mba marketing>
I'd rather the blinky sign was "ext4". It makes it clear it is a
progression and it also gives everyone something to put in the features
box and talk to the press about 8)
</mba>
> I still think making ext3 bigger than 16TB is just silly.
We recently fixed a 'If the disk is 4TB in size the geometry reporting
breaks and parted crashes' bug. The stuff is out there and people want
to run ext3 on it or an ext3 derivative they feel they trust. Does it
matter whether it is the most optimal solution, that'll sort itself out
as ext3.5/ext4, reiser4, jfs, xfs etc get picked and demanded by users
Alan
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:40 ` Jeff Garzik
2006-06-09 15:42 ` Matthew Wilcox
@ 2006-06-09 16:56 ` Andrew Morton
2006-06-09 17:07 ` Jeff Garzik
2006-06-09 18:23 ` Michael Poole
2 siblings, 1 reply; 295+ messages in thread
From: Andrew Morton @ 2006-06-09 16:56 UTC (permalink / raw)
To: Jeff Garzik; +Cc: hch, cmm, linux-kernel, ext2-devel, linux-fsdevel
On Fri, 09 Jun 2006 11:40:03 -0400
Jeff Garzik <jeff@garzik.org> wrote:
> Users are now forced to remember that, if they write to their filesystem
> after using either $mmver or $korgver kernels, they are locked out of
> using older kernels.
The same happens if we create ext4 - earlier kernels don't support that,
either.
I suppose we could call it ext4, although that wouldn't make much
difference operationally. The developers would probably choose to generate
ext4 from the same codebase as ext3 for maintainability reasons, rather
than choosing to copy-n-modify. We'd need to see the patches to be able to
finally make that judgement.
>
> And as features continue to be added in this manner, this problem gets
> _exponentially_ worse.
"continue to be added"? afaik this is the first time this has happened,
and there's no plan to do it again.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:56 ` Andrew Morton
@ 2006-06-09 17:07 ` Jeff Garzik
2006-06-09 17:35 ` Andrew Morton
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:07 UTC (permalink / raw)
To: Andrew Morton; +Cc: hch, cmm, linux-kernel, ext2-devel, linux-fsdevel
Andrew Morton wrote:
> On Fri, 09 Jun 2006 11:40:03 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
>
>> Users are now forced to remember that, if they write to their filesystem
>> after using either $mmver or $korgver kernels, they are locked out of
>> using older kernels.
>
> The same happens if we create ext4 - earlier kernels don't support that,
> either.
>
> I suppose we could call it ext4, although that wouldn't make much
> difference operationally. The developers would probably choose to generate
> ext4 from the same codebase as ext3 for maintainability reasons, rather
> than choosing to copy-n-modify. We'd need to see the patches to be able to
> finally make that judgement.
I would propose the obvious... 'cp -a ext3 ext4', apply the extent and
48bit patches, and then do the obvious search-n-replace.
I guarantee that developer momentum would take over from there. Rather
than fundamentally change ext3, let's let it stabilize.
>> And as features continue to be added in this manner, this problem gets
>> _exponentially_ worse.
>
> "continue to be added"? afaik this is the first time this has happened,
> and there's no plan to do it again.
ext3 developers are _fundamentally changing_ the block allocation
structure [in a good way]. If they can get away with it once, they will
continue to modify ext3, adding btrees and other new gadgets. That's
just human nature. For example, htree was a minor disaster,
deployment-wise, on the distro vendor side.
I think extents and 48bit are so fundamental that it's silly to attempt
to minimize the impact from the user's perspective, and moreover, I
think Linux benefits more if ext3 is _not_ kept on life support this way.
We need to draw a line in the sand. If we don't, no one ever will.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 17:07 ` Jeff Garzik
@ 2006-06-09 17:35 ` Andrew Morton
2006-06-09 17:48 ` Jeff Garzik
2006-06-09 21:42 ` Sonny Rao
0 siblings, 2 replies; 295+ messages in thread
From: Andrew Morton @ 2006-06-09 17:35 UTC (permalink / raw)
To: Jeff Garzik; +Cc: hch, linux-fsdevel, ext2-devel, cmm, linux-kernel
On Fri, 09 Jun 2006 13:07:37 -0400
Jeff Garzik <jeff@garzik.org> wrote:
> I would propose the obvious... 'cp -a ext3 ext4', apply the extent and
> 48bit patches, and then do the obvious search-n-replace.
Most of ext3 is JBD. At least, in terms of complexity. And I don't think
there's anything in this proposal which affects JBD, apart from changing
the blocksize.
Cloning JBD for this exercise would, I suspect, be the wrong thing to do -
the two clones would be pretty much identical, apart from some scalar
types.
I did suggest a couple of years ago that we should clone the ext3 part and
have both ext3 and ext4 use the same JBD layer - I don't know what happened
to that idea.
There has been steady, cautious but significant improvement happening in
ext3 over the past few years. I'd expect that to continue, although
perhaps at a lower rate. Having to apply the same changes to two
filesystems would be an obvious loss.
It comes down to looking at the patches, and I haven't done that in quite
some time. Ideally the new functionality would all be under CONFIG_foo,
but I do not know if that is being proposed here?
> We need to draw a line in the sand. If we don't, no one ever will.
You speak as if this is something which has happened before, or that it will
happen again.
All that being said, Linux's filesystems are looking increasingly crufty
and we are getting to the time where we would benefit from a greenfield
start-a-new-one. That new one might even be based on reiser4 - has anyone
looked? It's been sitting around for a couple of years.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 17:35 ` Andrew Morton
@ 2006-06-09 17:48 ` Jeff Garzik
2006-06-09 17:59 ` Jeff Garzik
2006-06-09 21:42 ` Sonny Rao
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:48 UTC (permalink / raw)
To: Andrew Morton; +Cc: hch, linux-fsdevel, ext2-devel, cmm, linux-kernel
Andrew Morton wrote:
> On Fri, 09 Jun 2006 13:07:37 -0400
> Jeff Garzik <jeff@garzik.org> wrote:
>
>> I would propose the obvious... 'cp -a ext3 ext4', apply the extent and
>> 48bit patches, and then do the obvious search-n-replace.
>
> Most of ext3 is JBD. At least, in terms of complexity. And I don't think
> there's anything in this proposal which affects JBD, apart from changing
> the blocksize.
>
> Cloning JBD for this exercise would, I suspect, be the wrong thing to do -
> the two clones would be pretty much identical, apart from some scalar
> types.
>
> I did suggest a couple of years ago that we should clone the ext3 part and
> have both ext3 and ext4 use the same JBD layer - I don't know what happened
> to that idea.
The JBD API is reasonably distinct, so IMO this would be a logical next
step. I would hope they could use the same JBD, so, I strongly agree...
> There has been steady, cautious but significant improvement happening in
> ext3 over the past few years. I'd expect that to continue, although
> perhaps at a lower rate. Having to apply the same changes to two
> filesystems would be an obvious loss.
I disagree completely... it would be an obvious win: people who want
stability get that, people who want new features get that too.
> It comes down to looking at the patches, and I haven't done that in quite
> some time. Ideally the new functionality would all be under CONFIG_foo,
> but I do not know if that is being proposed here?
>
>> We need to draw a line in the sand. If we don't, no one ever will.
>
> You speak as if this is something which has happened before, or that it will
> happen again.
>
> All that being said, Linux's filesystems are looking increasingly crufty
> and we are getting to the time where we would benefit from a greenfield
> start-a-new-one. That new one might even be based on reiser4 - has anyone
> looked? It's been sitting around for a couple of years.
reiser4 actually has this same problem, but worse. It has pluggable
metadata even to the point of supporting plugin-style metadata development.
If we can successfully devolve a filesystem to metadata and algorithm
plugins, that should be done at the VFS level, and not called "reiser4".
But in the absence of a different VFS API, I think it is the most
practical of all the options to open the floodgates to ext4 rather than
ext3.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 17:48 ` Jeff Garzik
@ 2006-06-09 17:59 ` Jeff Garzik
2006-06-09 18:27 ` [Ext2-devel] " Mike Snitzer
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 17:59 UTC (permalink / raw)
To: Andrew Morton; +Cc: hch, linux-fsdevel, ext2-devel, cmm, linux-kernel
Jeff Garzik wrote:
> I disagree completely... it would be an obvious win: people who want
> stability get that, people who want new features get that too.
And developers have a better outlet for their wacky developmental urges...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 17:59 ` Jeff Garzik
@ 2006-06-09 18:27 ` Mike Snitzer
2006-06-09 18:54 ` Jeff Garzik
2006-06-10 13:49 ` Adrian Bunk
0 siblings, 2 replies; 295+ messages in thread
From: Mike Snitzer @ 2006-06-09 18:27 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, hch, linux-fsdevel, ext2-devel, cmm, linux-kernel
On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
> Jeff Garzik wrote:
> > I disagree completely... it would be an obvious win: people who want
> > stability get that, people who want new features get that too.
>
> And developers have a better outlet for their wacky developmental urges...
And no real-world near-term progress is made for production users with
modern requirements. What you're advocating breeds instability in the
near-term.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:27 ` [Ext2-devel] " Mike Snitzer
@ 2006-06-09 18:54 ` Jeff Garzik
2006-06-09 19:22 ` Alex Tomas
2006-06-10 13:49 ` Adrian Bunk
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:54 UTC (permalink / raw)
To: Mike Snitzer
Cc: Andrew Morton, hch, linux-fsdevel, ext2-devel, cmm, linux-kernel
Mike Snitzer wrote:
> On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
>> Jeff Garzik wrote:
>> > I disagree completely... it would be an obvious win: people who want
>> > stability get that, people who want new features get that too.
>>
>> And developers have a better outlet for their wacky developmental
>> urges...
>
> And no real-world near-term progress is made for production users with
> modern requirements. What you're advocating breeds instability in the
> near-term.
Constantly patching the main, "stable" Linux filesystem breeds
instability today.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:54 ` Jeff Garzik
@ 2006-06-09 19:22 ` Alex Tomas
2006-06-09 19:23 ` Jeff Garzik
2006-06-09 22:49 ` Valdis.Kletnieks
0 siblings, 2 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 19:22 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, hch, cmm, linux-fsdevel
what if proposed patch is safer than an average fix?
(given that it's just out of usage unless enabled)
thanks, Alex
>>>>> Jeff Garzik (JG) writes:
JG> Mike Snitzer wrote:
>> On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
>>> Jeff Garzik wrote:
>>> > I disagree completely... it would be an obvious win: people who want
>>> > stability get that, people who want new features get that too.
>>>
>>> And developers have a better outlet for their wacky developmental
>>> urges...
>>
>> And no real-world near-term progress is made for production users with
>> modern requirements. What you're advocating breeds instability in the
>> near-term.
JG> Constantly patching the main, "stable" Linux filesystem breeds
JG> instability today.
JG> Jeff
JG> _______________________________________________
JG> Ext2-devel mailing list
JG> Ext2-devel@lists.sourceforge.net
JG> https://lists.sourceforge.net/lists/listinfo/ext2-devel
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:22 ` Alex Tomas
@ 2006-06-09 19:23 ` Jeff Garzik
2006-06-09 22:49 ` Valdis.Kletnieks
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 19:23 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, ext2-devel, linux-kernel, hch, cmm, linux-fsdevel
Alex Tomas wrote:
> what if proposed patch is safer than an average fix?
> (given that it's just out of usage unless enabled)
Regardless of how you phrase it, it is an inescapable fact that you are
developing new stuff in the main, supposedly-stable Linux filesystem.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 19:22 ` Alex Tomas
2006-06-09 19:23 ` Jeff Garzik
@ 2006-06-09 22:49 ` Valdis.Kletnieks
2006-06-09 23:34 ` [Ext2-devel] " Andreas Dilger
1 sibling, 1 reply; 295+ messages in thread
From: Valdis.Kletnieks @ 2006-06-09 22:49 UTC (permalink / raw)
To: Alex Tomas
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel, hch, cmm,
linux-fsdevel
[-- Attachment #1.1: Type: text/plain, Size: 337 bytes --]
On Fri, 09 Jun 2006 23:22:23 +0400, Alex Tomas said:
> what if proposed patch is safer than an average fix?
> (given that it's just out of usage unless enabled)
Those are the *dangerous* patches, because they usually contain bugs
that weren't tripped over by the 6 people who enabled it while it
was bouncing around in the -mm tree....
[-- Attachment #1.2: Type: application/pgp-signature, Size: 226 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
[-- Attachment #3: Type: text/plain, Size: 161 bytes --]
_______________________________________________
Ext2-devel mailing list
Ext2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ext2-devel
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 22:49 ` Valdis.Kletnieks
@ 2006-06-09 23:34 ` Andreas Dilger
0 siblings, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 23:34 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Alex Tomas, Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
hch, cmm, linux-fsdevel
On Jun 09, 2006 18:49 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 09 Jun 2006 23:22:23 +0400, Alex Tomas said:
> > what if proposed patch is safer than an average fix?
> > (given that it's just out of usage unless enabled)
>
> Those are the *dangerous* patches, because they usually contain bugs
> that weren't tripped over by the 6 people who enabled it while it
> was bouncing around in the -mm tree....
Umm, in case you didn't know, the extent patch which is the primary issue
of discussion here (not the whole 64-bit clean changes though) were run
for MILLIONS of hours under very high IO load on the largest computer
systems in the world for the last year or so. It is easy to get millions
of hours of usage if there are thousands of servers running this code...
Yes, I have no doubt there will be bugs in the code because the usage
pattern is different for different environments, but we aren't advocating
the inclusion of something major like this that was just written
yesterday in someone's basement.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:27 ` [Ext2-devel] " Mike Snitzer
2006-06-09 18:54 ` Jeff Garzik
@ 2006-06-10 13:49 ` Adrian Bunk
2006-06-10 13:51 ` Christoph Hellwig
1 sibling, 1 reply; 295+ messages in thread
From: Adrian Bunk @ 2006-06-10 13:49 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jeff Garzik, Andrew Morton, hch, linux-fsdevel, ext2-devel, cmm,
linux-kernel
On Fri, Jun 09, 2006 at 02:27:53PM -0400, Mike Snitzer wrote:
> On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
> >Jeff Garzik wrote:
> >> I disagree completely... it would be an obvious win: people who want
> >> stability get that, people who want new features get that too.
> >
> >And developers have a better outlet for their wacky developmental urges...
>
> And no real-world near-term progress is made for production users with
> modern requirements. What you're advocating breeds instability in the
> near-term.
There's also the old-fashioned "no regressions" requirement.
You are trading near-term instability for the few users with "modern
requirements" against possible regressions for a large userbase.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 13:49 ` Adrian Bunk
@ 2006-06-10 13:51 ` Christoph Hellwig
2006-06-10 14:54 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Christoph Hellwig @ 2006-06-10 13:51 UTC (permalink / raw)
To: Adrian Bunk
Cc: Mike Snitzer, Jeff Garzik, Andrew Morton, hch, linux-fsdevel,
ext2-devel, cmm, linux-kernel
On Sat, Jun 10, 2006 at 03:49:46PM +0200, Adrian Bunk wrote:
> > And no real-world near-term progress is made for production users with
> > modern requirements. What you're advocating breeds instability in the
> > near-term.
>
> There's also the old-fashioned "no regressions" requirement.
>
> You are trading near-term instability for the few users with "modern
> requirements" against possible regressions for a large userbase.
Alex mentioned a few times that the extents code just adds three if.
I'm pretty sure that will not give you any regressions in the existing
codebase. Can we concentrate on the more useful discussion topics now?
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 13:51 ` Christoph Hellwig
@ 2006-06-10 14:54 ` Jeff Garzik
2006-06-10 18:01 ` [Ext2-devel] " Andreas Dilger
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-10 14:54 UTC (permalink / raw)
To: Christoph Hellwig, Andrew Morton
Cc: ext2-devel, linux-kernel, cmm, linux-fsdevel, Adrian Bunk
Christoph Hellwig wrote:
> On Sat, Jun 10, 2006 at 03:49:46PM +0200, Adrian Bunk wrote:
>>> And no real-world near-term progress is made for production users with
>>> modern requirements. What you're advocating breeds instability in the
>>> near-term.
>> There's also the old-fashioned "no regressions" requirement.
>>
>> You are trading near-term instability for the few users with "modern
>> requirements" against possible regressions for a large userbase.
>
> Alex mentioned a few times that the extents code just adds three if.
> I'm pretty sure that will not give you any regressions in the existing
> codebase. Can we concentrate on the more useful discussion topics now?
Alex is off by an order of magnitude. I've re-read the 13-patch series,
and this is the result of the review:
There are _five_ "if (new) .. else .." constructs added in JBD alone.
Three added in extent map support.
Twenty-seven (27) such constructs in 48-bit physical block support.
Two more in 48-bit ACL support.
And finally, the superblock changes don't add any branches, like the
other code does, but it does double the endian conversion work that
-every- user must do, even if they don't use 48bit at all.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 14:54 ` Jeff Garzik
@ 2006-06-10 18:01 ` Andreas Dilger
0 siblings, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 18:01 UTC (permalink / raw)
To: Jeff Garzik
Cc: Christoph Hellwig, Andrew Morton, Adrian Bunk, Mike Snitzer,
linux-fsdevel, ext2-devel, cmm, linux-kernel
On Jun 10, 2006 10:54 -0400, Jeff Garzik wrote:
> Christoph Hellwig wrote:
> >Alex mentioned a few times that the extents code just adds three if.
> >I'm pretty sure that will not give you any regressions in the existing
> >codebase. Can we concentrate on the more useful discussion topics now?
>
> Alex is off by an order of magnitude. I've re-read the 13-patch series,
> and this is the result of the review:
Thanks for at least looking at the code, which was the intention of posting
the patches... It caused quite a few more ruffled feathers than we expected.
> Three added in extent map support.
As Christoph quoted Alex, "the extents code", which you confirm is 3 "ifs".
> There are _five_ "if (new) .. else .." constructs added in JBD alone.
Actually, 64-bit support in the JBD code was written by Zach Brown
for OCFS, so I think they want this patch into the kernel regardless.
It's relatively simple change though - all conditional on a single flag.
> Twenty-seven (27) such constructs in 48-bit physical block support.
Though there are really only 2 conditionals (in macros, one for read and
one for write) that are used everywhere, so it's not as bad as it seems.
> Two more in 48-bit ACL support.
>
> And finally, the superblock changes don't add any branches, like the
> other code does, but it does double the endian conversion work that
> -every- user must do, even if they don't use 48bit at all.
These are all related to 48-bit filesystem support, not strictly
extents. Much of the 48-bit code is dependent upon CONFIG_LBD or
sizeof(ext3_fsblk_t), so if people have no desire to use large (2TB+) or
larger (16TB+) filesystems these conditionals disappear at compile time.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 17:35 ` Andrew Morton
2006-06-09 17:48 ` Jeff Garzik
@ 2006-06-09 21:42 ` Sonny Rao
2006-06-09 22:15 ` Andrew Morton
1 sibling, 1 reply; 295+ messages in thread
From: Sonny Rao @ 2006-06-09 21:42 UTC (permalink / raw)
To: Andrew Morton
Cc: Jeff Garzik, hch, cmm, linux-kernel, ext2-devel, linux-fsdevel
On Fri, Jun 09, 2006 at 10:35:43AM -0700, Andrew Morton wrote:
<snip>
> All that being said, Linux's filesystems are looking increasingly crufty
> and we are getting to the time where we would benefit from a greenfield
> start-a-new-one.
I'm curious about this comment; in what way are they _collectively_
looking crufty ?
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:42 ` Sonny Rao
@ 2006-06-09 22:15 ` Andrew Morton
2006-06-09 23:11 ` Andreas Dilger
2006-06-10 3:49 ` Nathan Scott
0 siblings, 2 replies; 295+ messages in thread
From: Andrew Morton @ 2006-06-09 22:15 UTC (permalink / raw)
To: Sonny Rao; +Cc: jeff, hch, cmm, linux-kernel, ext2-devel, linux-fsdevel
Sonny Rao <sonny@burdell.org> wrote:
>
> On Fri, Jun 09, 2006 at 10:35:43AM -0700, Andrew Morton wrote:
> <snip>
> > All that being said, Linux's filesystems are looking increasingly crufty
> > and we are getting to the time where we would benefit from a greenfield
> > start-a-new-one.
>
> I'm curious about this comment; in what way are they _collectively_
> looking crufty ?
We seem to be lagging behind "the industry" in some areas - handling large
devices, high bandwidth IO, sophisticated on-disk data structures, advanced
manageability, etc.
I mean, although ZFS is a rampant layering violation and we can do a lot of
the things in there (without doing it all in the fs!) I don't think we can
do all of it.
We're continuing to nurse along a few basically-15-year-old filesystems
while we do have the brains, manpower and processes to implement a new,
really great one.
It's just this feeling I have ;)
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 22:15 ` Andrew Morton
@ 2006-06-09 23:11 ` Andreas Dilger
2006-06-09 23:15 ` Jeff Garzik
2006-06-10 3:37 ` Valerie Henson
2006-06-10 3:49 ` Nathan Scott
1 sibling, 2 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 23:11 UTC (permalink / raw)
To: Andrew Morton
Cc: jeff, ext2-devel, linux-kernel, hch, cmm, linux-fsdevel,
Sonny Rao
On Jun 09, 2006 15:15 -0700, Andrew Morton wrote:
> We seem to be lagging behind "the industry" in some areas - handling large
> devices, high bandwidth IO, sophisticated on-disk data structures, advanced
> manageability, etc.
>
> I mean, although ZFS is a rampant layering violation and we can do a lot of
> the things in there (without doing it all in the fs!) I don't think we can
> do all of it.
>
> We're continuing to nurse along a few basically-15-year-old filesystems
> while we do have the brains, manpower and processes to implement a new,
> really great one.
>
> It's just this feeling I have ;)
I think many people share this feeling (me included), hence the linux
filesystem meeting next week... The problem is that even getting a
half-decent disk filesystem is many years of work, and large disks are
here before then. The ZFS code took 10 years to get to its current state,
I understand, so I don't anticipate we will get there overnight.
The question is whether we can get to this state more easily by starting
on a known-good base (ext3) or by starting from scratch. My opinion is
strongly in the "start from a known-good base" camp, and make incremental
improvements to that base instead of discarding everything and starting
again.
I think the real frontier for future filesystem development is in the
ZFS direction where the filesystem can be robust in the face of data
errors without having a single fail-stop mode of error handling. While
ext2 and ext3 have been OK in this regard they can definitely be improved
without discarding the rest of the code and the millions of hours of
testing that has gone into it.
I'm not so strongly against ext4 that I won't follow that route if needed,
but it essentially means that ext3 will be orphaned.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 23:11 ` Andreas Dilger
@ 2006-06-09 23:15 ` Jeff Garzik
2006-06-10 3:37 ` Valerie Henson
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 23:15 UTC (permalink / raw)
To: Andrew Morton, Sonny Rao, jeff, hch, cmm, linux-kernel,
ext2-devel, linux-fsdevel
Andreas Dilger wrote:
> I'm not so strongly against ext4 that I won't follow that route if needed,
> but it essentially means that ext3 will be orphaned.
Not orphaned but scaled back over time. IMO there's only so much
developer and brain and test bandwidth for "the main Linux filesystem."
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 23:11 ` Andreas Dilger
2006-06-09 23:15 ` Jeff Garzik
@ 2006-06-10 3:37 ` Valerie Henson
1 sibling, 0 replies; 295+ messages in thread
From: Valerie Henson @ 2006-06-10 3:37 UTC (permalink / raw)
To: Andrew Morton, Sonny Rao, jeff, hch, cmm, linux-kernel,
ext2-devel, linux-fsdevel
On Fri, Jun 09, 2006 at 05:11:52PM -0600, Andreas Dilger wrote:
> On Jun 09, 2006 15:15 -0700, Andrew Morton wrote:
> >
> > We're continuing to nurse along a few basically-15-year-old filesystems
> > while we do have the brains, manpower and processes to implement a new,
> > really great one.
> >
> > It's just this feeling I have ;)
>
> I think many people share this feeling (me included), hence the linux
> filesystem meeting next week... The problem is that even getting a
> half-decent disk filesystem is many years of work, and large disks are
> here before then. The ZFS code took 10 years to get to its current state,
> I understand, so I don't anticipate we will get there overnight.
I helped bring up the first instance of ZFS running as a kernel module
on Halloween, 2002 (one fun week staying up all night hacking with
Jeff Bonwick). The earliest code was written in either 2001 or just
possibly 2000 - so 5-6 years in elapsed time. On the other hand, in
terms of total programmer staff-years put into ZFS, it's on the order
of 25 years.
I'm not sure either what the best route to the next big Linux file
system is - start from scratch or reuse a lot of code. One of the
things I want to talk about at the workshop is creative reuse of
existing code, a la the continuation inode idea.
-VAL
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 22:15 ` Andrew Morton
2006-06-09 23:11 ` Andreas Dilger
@ 2006-06-10 3:49 ` Nathan Scott
1 sibling, 0 replies; 295+ messages in thread
From: Nathan Scott @ 2006-06-10 3:49 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, ext2-devel, linux-kernel
On Fri, Jun 09, 2006 at 03:15:53PM -0700, Andrew Morton wrote:
> Sonny Rao <sonny@burdell.org> wrote:
> > On Fri, Jun 09, 2006 at 10:35:43AM -0700, Andrew Morton wrote:
> > <snip>
> > > All that being said, Linux's filesystems are looking increasingly crufty
> > > and we are getting to the time where we would benefit from a greenfield
> > > start-a-new-one.
> >
> > I'm curious about this comment; in what way are they _collectively_
> > looking crufty ?
>
> We seem to be lagging behind "the industry" in some areas - handling large
> devices, high bandwidth IO, sophisticated on-disk data structures, advanced
> manageability, etc.
Er, no. I'm not aware of many filesystems that are in the same
league as XFS on those first three specific points. It certainly
has "ondisk sophistication" very well covered, trust me. ;)
We are definately not lagging on handling large devices nor high
bandwidth I/O anyway - XFS serves up very close to the hardware
capabilities for high end hardware and it scales well. One could
come up with a different list of areas where Linux filesystems
might be lagging, but that list above ain't right.
> I mean, although ZFS is a rampant layering violation and we can do a lot of
> the things in there (without doing it all in the fs!) I don't think we can
> do all of it.
*nod*.
cheers.
--
Nathan
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:40 ` Jeff Garzik
2006-06-09 15:42 ` Matthew Wilcox
2006-06-09 16:56 ` Andrew Morton
@ 2006-06-09 18:23 ` Michael Poole
2006-06-09 18:55 ` Jeff Garzik
2006-06-10 0:49 ` Sven-Haegar Koch
2 siblings, 2 replies; 295+ messages in thread
From: Michael Poole @ 2006-06-09 18:23 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, Christoph Hellwig, cmm, linux-kernel, ext2-devel,
linux-fsdevel
Jeff Garzik writes:
> Andrew Morton wrote:
> > Ted&co have been pretty good at avoiding compatibility problems.
>
> Well, extents and 48bit make that track record demonstrably worse.
>
> Users are now forced to remember that, if they write to their
> filesystem after using either $mmver or $korgver kernels, they are
> locked out of using older kernels.
Users are also forced to remember that, if they use certain new
distros or programs, they are locked out of using older kernels. They
are forced to remember that if they have certain newer hardware, they
are locked out of using older kernels. They are forced to remember
that if they use ext3 (or XFS or JFS) _at all_ they are locked out of
using older kernels. Why single out this particular aspect of limited
forward compatibility to harp on so much?
Michael Poole
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:23 ` Michael Poole
@ 2006-06-09 18:55 ` Jeff Garzik
2006-06-09 19:42 ` [Ext2-devel] " Gerrit Huizenga
2006-06-10 0:49 ` Sven-Haegar Koch
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:55 UTC (permalink / raw)
To: Michael Poole
Cc: Andrew Morton, ext2-devel, linux-kernel, Christoph Hellwig, cmm,
linux-fsdevel
Michael Poole wrote:
> Jeff Garzik writes:
>
>> Andrew Morton wrote:
>>> Ted&co have been pretty good at avoiding compatibility problems.
>> Well, extents and 48bit make that track record demonstrably worse.
>>
>> Users are now forced to remember that, if they write to their
>> filesystem after using either $mmver or $korgver kernels, they are
>> locked out of using older kernels.
>
> Users are also forced to remember that, if they use certain new
> distros or programs, they are locked out of using older kernels. They
> are forced to remember that if they have certain newer hardware, they
> are locked out of using older kernels. They are forced to remember
> that if they use ext3 (or XFS or JFS) _at all_ they are locked out of
> using older kernels. Why single out this particular aspect of limited
> forward compatibility to harp on so much?
Because it's called backwards compat, when it isn't?
Because it is very difficult to find out which set of kernels you are
locked out of?
Because the filesystem upgrade is stealthy, occurring as it does on the
first data write?
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 18:55 ` Jeff Garzik
@ 2006-06-09 19:42 ` Gerrit Huizenga
2006-06-09 20:00 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Gerrit Huizenga @ 2006-06-09 19:42 UTC (permalink / raw)
To: Jeff Garzik
Cc: Michael Poole, Andrew Morton, ext2-devel, linux-kernel,
Christoph Hellwig, cmm, linux-fsdevel
On Fri, 09 Jun 2006 14:55:56 EDT, Jeff Garzik wrote:
>
> Because it's called backwards compat, when it isn't?
> Because it is very difficult to find out which set of kernels you are
> locked out of?
> Because the filesystem upgrade is stealthy, occurring as it does on the
> first data write?
Actually, the *only* point being contended here is running older
kernels on some newer filesystems (created originally with a newer
kernel), right?
Or do you have examples of where current kernels could not deal
with an ext3 feature at some point in time?
I would argue that 0.001% of all Linux *users* actually worry about
this - most of them are right here on the development mailing list.
So, that group is more vocal, for sure. But, if it works for 99.99+%
users, aren't we still on the good path, from the point of view of
those people who actually *use* Linux the most?
gerrit
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 19:42 ` [Ext2-devel] " Gerrit Huizenga
@ 2006-06-09 20:00 ` Jeff Garzik
2006-06-09 20:08 ` Alex Tomas
2006-06-09 20:35 ` Theodore Tso
0 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:00 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Michael Poole, Andrew Morton, ext2-devel, linux-kernel,
Christoph Hellwig, cmm, linux-fsdevel
Gerrit Huizenga wrote:
> On Fri, 09 Jun 2006 14:55:56 EDT, Jeff Garzik wrote:
>> Because it's called backwards compat, when it isn't?
>> Because it is very difficult to find out which set of kernels you are
>> locked out of?
>> Because the filesystem upgrade is stealthy, occurring as it does on the
>> first data write?
>
> Actually, the *only* point being contended here is running older
> kernels on some newer filesystems (created originally with a newer
> kernel), right?
>
> Or do you have examples of where current kernels could not deal
> with an ext3 feature at some point in time?
>
> I would argue that 0.001% of all Linux *users* actually worry about
> this - most of them are right here on the development mailing list.
> So, that group is more vocal, for sure. But, if it works for 99.99+%
> users, aren't we still on the good path, from the point of view of
> those people who actually *use* Linux the most?
The overall objection is to treating ext3 as a highly mutable,
one-size-fits-all filesystem.
Maybe there is value in moving some reiser4 concepts -- a set of
metadata+algorithm plugins -- to the VFS level. I dunno.
But for ext3 specifically, it seems like bolting on extents, 48bit,
delayed allocation, and other new features weren't really suited for the
original ext2-style design. Outside of the support (and marketing,
because that's all version numbers are in the end) issues already
mentioned, I think it falls into the nebulous realm of "taste."
Rather than taking another decade to slowly fix ext2 design decisions,
why not move the process along a bit more rapidly? Release early,
release often...
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:00 ` Jeff Garzik
@ 2006-06-09 20:08 ` Alex Tomas
2006-06-09 20:10 ` [Ext2-devel] " Jeff Garzik
2006-06-09 20:35 ` Theodore Tso
1 sibling, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 20:08 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Michael Poole,
Christoph Hellwig, Gerrit Huizenga, cmm, linux-fsdevel
>>>>> Jeff Garzik (JG) writes:
JG> Rather than taking another decade to slowly fix ext2 design decisions,
JG> why not move the process along a bit more rapidly? Release early,
JG> release often...
that could be true, if we were talking about something yet to be
designed, coded and tested.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 20:08 ` Alex Tomas
@ 2006-06-09 20:10 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 20:10 UTC (permalink / raw)
To: Alex Tomas
Cc: Gerrit Huizenga, Andrew Morton, ext2-devel, linux-kernel,
Michael Poole, Christoph Hellwig, cmm, linux-fsdevel
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> Rather than taking another decade to slowly fix ext2 design decisions,
> JG> why not move the process along a bit more rapidly? Release early,
> JG> release often...
>
> that could be true, if we were talking about something yet to be
> designed, coded and tested.
'cp ext3 ext4' already has its first two features: extents and 48bit.
And it works today. Tested to the extent that the submittor has tested it.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 20:00 ` Jeff Garzik
2006-06-09 20:08 ` Alex Tomas
@ 2006-06-09 20:35 ` Theodore Tso
2006-06-09 21:41 ` Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 20:35 UTC (permalink / raw)
To: Jeff Garzik
Cc: Gerrit Huizenga, Michael Poole, Andrew Morton, ext2-devel,
linux-kernel, Christoph Hellwig, cmm, linux-fsdevel
On Fri, Jun 09, 2006 at 04:00:44PM -0400, Jeff Garzik wrote:
> But for ext3 specifically, it seems like bolting on extents, 48bit,
> delayed allocation, and other new features weren't really suited for the
> original ext2-style design. Outside of the support (and marketing,
> because that's all version numbers are in the end) issues already
> mentioned, I think it falls into the nebulous realm of "taste."
If is very much a matter of taste, why are you trying to dictate to
the ext2 developers how they choose to do things? As long as it
works, and we haven't screwed up yet, I'd argue this is falls into the
category of letting each subsystem decide how they best work. The way
DaveM and the networking team works is quite different from how the
SCSI developers work or the XFS team work --- it's not a
one-size-fits-all sort of thing.
And I'd also dispute with your "weren't really suited for the original
ext2-style design" comment. Ext2/3 was always designed to be
extensible from the start, and we've successfully added features quite
successfully for quite a while.
> Rather than taking another decade to slowly fix ext2 design decisions,
> why not move the process along a bit more rapidly? Release early,
> release often...
I don't think it will be another decade, but yes, regardless of
whether we do a code fork or not, it will take time. Basically, you
and the ext2 developers have a disagreement about whether or not a
code fork will actually move the process along more quickly or not.
Either way, we will be releasing early and often, so people can test
it out and comment on it. Releasing patches to LKML is just the first
step in this process.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 20:35 ` Theodore Tso
@ 2006-06-09 21:41 ` Jeff Garzik
2006-06-09 21:45 ` [Ext2-devel] " Michael Poole
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:41 UTC (permalink / raw)
To: Theodore Tso
Cc: Andrew Morton, ext2-devel, linux-kernel, Michael Poole,
Christoph Hellwig, Gerrit Huizenga, cmm, linux-fsdevel
Theodore Tso wrote:
> And I'd also dispute with your "weren't really suited for the original
> ext2-style design" comment. Ext2/3 was always designed to be
> extensible from the start, and we've successfully added features quite
> successfully for quite a while.
Although not the only disk format change, extents are a pretty big one.
Will this be the last major on-disk format change?
>> Rather than taking another decade to slowly fix ext2 design decisions,
>> why not move the process along a bit more rapidly? Release early,
>> release often...
>
> I don't think it will be another decade, but yes, regardless of
> whether we do a code fork or not, it will take time. Basically, you
> and the ext2 developers have a disagreement about whether or not a
> code fork will actually move the process along more quickly or not.
> Either way, we will be releasing early and often, so people can test
> it out and comment on it. Releasing patches to LKML is just the first
> step in this process.
I don't see how a larger filesystem codebase could possibly move more
quickly than a smaller codebase. You'd have twice as many code paths to
worry about.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 21:41 ` Jeff Garzik
@ 2006-06-09 21:45 ` Michael Poole
2006-06-09 21:53 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Michael Poole @ 2006-06-09 21:45 UTC (permalink / raw)
To: Jeff Garzik
Cc: Theodore Tso, Gerrit Huizenga, Andrew Morton, ext2-devel,
linux-kernel, Christoph Hellwig, cmm, linux-fsdevel
Jeff Garzik writes:
> Theodore Tso wrote:
> > And I'd also dispute with your "weren't really suited for the original
> > ext2-style design" comment. Ext2/3 was always designed to be
> > extensible from the start, and we've successfully added features quite
> > successfully for quite a while.
>
> Although not the only disk format change, extents are a pretty big
> one. Will this be the last major on-disk format change?
You keep making "straw that broke the camel's back" type arguments
without saying why this particular straw (rather than the other
compatibility-breaking features that are already in ext3) is the one
that must not be allowed. Is it a matter of taste, or is there some
objective threshold that extents cross?
> >> Rather than taking another decade to slowly fix ext2 design
> >> decisions, why not move the process along a bit more rapidly?
> >> Release early, release often...
> > I don't think it will be another decade, but yes, regardless of
> > whether we do a code fork or not, it will take time. Basically, you
> > and the ext2 developers have a disagreement about whether or not a
> > code fork will actually move the process along more quickly or not.
> > Either way, we will be releasing early and often, so people can test
> > it out and comment on it. Releasing patches to LKML is just the first
> > step in this process.
>
> I don't see how a larger filesystem codebase could possibly move more
> quickly than a smaller codebase. You'd have twice as many code paths
> to worry about.
This is also the case when you cut and paste an entire filesystem's
source code, as has been mentioned several times in this thread.
Michael Poole
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 21:45 ` [Ext2-devel] " Michael Poole
@ 2006-06-09 21:53 ` Jeff Garzik
2006-06-09 22:04 ` Theodore Tso
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 21:53 UTC (permalink / raw)
To: Michael Poole
Cc: Theodore Tso, Gerrit Huizenga, Andrew Morton, ext2-devel,
linux-kernel, Christoph Hellwig, cmm, linux-fsdevel
Michael Poole wrote:
> Jeff Garzik writes:
>
>> Theodore Tso wrote:
>>> And I'd also dispute with your "weren't really suited for the original
>>> ext2-style design" comment. Ext2/3 was always designed to be
>>> extensible from the start, and we've successfully added features quite
>>> successfully for quite a while.
>> Although not the only disk format change, extents are a pretty big
>> one. Will this be the last major on-disk format change?
>
> You keep making "straw that broke the camel's back" type arguments
> without saying why this particular straw (rather than the other
> compatibility-breaking features that are already in ext3) is the one
> that must not be allowed. Is it a matter of taste, or is there some
> objective threshold that extents cross?
Yes, it's not a small change to the on-disk format.
If you write tools that read an ext3 filesystem, you won't be able to
read file data at all, without updating your code.
That's a much bigger deal than say 32-bit uids.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 21:53 ` Jeff Garzik
@ 2006-06-09 22:04 ` Theodore Tso
0 siblings, 0 replies; 295+ messages in thread
From: Theodore Tso @ 2006-06-09 22:04 UTC (permalink / raw)
To: Jeff Garzik
Cc: Andrew Morton, ext2-devel, linux-kernel, Michael Poole,
Christoph Hellwig, Gerrit Huizenga, cmm, linux-fsdevel
On Fri, Jun 09, 2006 at 05:53:14PM -0400, Jeff Garzik wrote:
> Yes, it's not a small change to the on-disk format.
>
> If you write tools that read an ext3 filesystem, you won't be able to
> read file data at all, without updating your code.
Most tools that read an ext2/3 filesystem directly use the libext2fs
library, and it will definitely be the case that for files smaller
than 4TB, even on a filesystem with extents enabled, as long as you
are using a version of libext2fs which is extents-aware, it will work
without any changes.
For files larger than 4TB, we will need some kind of LFS-like
interface change (i.e., ext2fs_file_llseek64 vs. ext2fs_file_llseek),
but that should be the only change needed by the tool.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 18:23 ` Michael Poole
2006-06-09 18:55 ` Jeff Garzik
@ 2006-06-10 0:49 ` Sven-Haegar Koch
2006-06-10 1:06 ` Theodore Tso
1 sibling, 1 reply; 295+ messages in thread
From: Sven-Haegar Koch @ 2006-06-10 0:49 UTC (permalink / raw)
To: Michael Poole
Cc: Jeff Garzik, Andrew Morton, Christoph Hellwig, cmm, linux-kernel,
ext2-devel, linux-fsdevel
On Fri, 9 Jun 2006, Michael Poole wrote:
> Jeff Garzik writes:
>
>> Andrew Morton wrote:
>>> Ted&co have been pretty good at avoiding compatibility problems.
>>
>> Well, extents and 48bit make that track record demonstrably worse.
>>
>> Users are now forced to remember that, if they write to their
>> filesystem after using either $mmver or $korgver kernels, they are
>> locked out of using older kernels.
>
> Users are also forced to remember that, if they use certain new
> distros or programs, they are locked out of using older kernels. They
> are forced to remember that if they have certain newer hardware, they
> are locked out of using older kernels. They are forced to remember
> that if they use ext3 (or XFS or JFS) _at all_ they are locked out of
> using older kernels. Why single out this particular aspect of limited
> forward compatibility to harp on so much?
I see a different problem with "ext3 + extends is not ext3 anymore" when
the feature goes mainstream:
- user with old distri, no extends in use, no kernel support for them
- user has some kind of problem
- uses new rescue disk (aka knoppix at the time of problem) - that then
is current stuff, and certainly uses extents - fixes problem on disk
(may be a simple as running lilo/grub from chroot, happens often for me)
- tries to boot back into his distri -> *boom* he lost
c'ya
sven
--
The Internet treats censorship as a routing problem, and routes around it.
(John Gilmore on http://www.cygnus.com/~gnu/)
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 0:49 ` Sven-Haegar Koch
@ 2006-06-10 1:06 ` Theodore Tso
2006-06-10 14:07 ` Olivier Galibert
0 siblings, 1 reply; 295+ messages in thread
From: Theodore Tso @ 2006-06-10 1:06 UTC (permalink / raw)
To: Sven-Haegar Koch
Cc: Andrew Morton, Jeff Garzik, ext2-devel, linux-kernel,
Michael Poole, Christoph Hellwig, cmm, linux-fsdevel
On Sat, Jun 10, 2006 at 02:49:32AM +0200, Sven-Haegar Koch wrote:
> I see a different problem with "ext3 + extends is not ext3 anymore" when
> the feature goes mainstream:
> - user with old distri, no extends in use, no kernel support for them
> - user has some kind of problem
> - uses new rescue disk (aka knoppix at the time of problem) - that then
> is current stuff, and certainly uses extents - fixes problem on disk
> (may be a simple as running lilo/grub from chroot, happens often for me)
> - tries to boot back into his distri -> *boom* he lost
Incorrect, because unless you explicitly enable the use of extents,
the mere act of using a new kernel such as might be found on knoppix
will not result in the filesystem utilizing the extent feature.
There's a lot FUD being spread by people who haven't been bothering to
understand what is being proposed, and that's disappointing.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 1:06 ` Theodore Tso
@ 2006-06-10 14:07 ` Olivier Galibert
2006-06-10 19:52 ` Theodore Tso
0 siblings, 1 reply; 295+ messages in thread
From: Olivier Galibert @ 2006-06-10 14:07 UTC (permalink / raw)
To: Theodore Tso, Sven-Haegar Koch, Michael Poole, Jeff Garzik,
Andrew Morton, Christoph Hellwig, cmm, linux-kernel, ext2-devel,
linux-fsdevel
On Fri, Jun 09, 2006 at 09:06:51PM -0400, Theodore Tso wrote:
> On Sat, Jun 10, 2006 at 02:49:32AM +0200, Sven-Haegar Koch wrote:
> > I see a different problem with "ext3 + extends is not ext3 anymore" when
> > the feature goes mainstream:
> > - user with old distri, no extends in use, no kernel support for them
> > - user has some kind of problem
> > - uses new rescue disk (aka knoppix at the time of problem) - that then
> > is current stuff, and certainly uses extents - fixes problem on disk
> > (may be a simple as running lilo/grub from chroot, happens often for me)
> > - tries to boot back into his distri -> *boom* he lost
>
> Incorrect, because unless you explicitly enable the use of extents,
> the mere act of using a new kernel such as might be found on knoppix
> will not result in the filesystem utilizing the extent feature.
And how shall the rescue/live CD know whether to use the feature?
OG.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-10 14:07 ` Olivier Galibert
@ 2006-06-10 19:52 ` Theodore Tso
0 siblings, 0 replies; 295+ messages in thread
From: Theodore Tso @ 2006-06-10 19:52 UTC (permalink / raw)
To: Olivier Galibert, Sven-Haegar Koch, Michael Poole, Jeff Garzik,
Andrew Morton, Christoph Hellwig, cmm, linux-kernel, ext2-devel,
linux-fsdevel
On Sat, Jun 10, 2006 at 04:07:14PM +0200, Olivier Galibert wrote:
> > Incorrect, because unless you explicitly enable the use of extents,
> > the mere act of using a new kernel such as might be found on knoppix
> > will not result in the filesystem utilizing the extent feature.
>
> And how shall the rescue/live CD know whether to use the feature?
Because there will be a bit the superblock that the user will have to
explicitly enable in order to get extents, so a new kernel on the
rescue/live CD will no whether or not extents are allowed --- just as
today, you have to explicitly enable hashed tree directory indexing
with the command, tune2fs -O dir_index /dev/hdXXX.
- Ted
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 9:13 ` Christoph Hellwig
2006-06-09 10:07 ` Andrew Morton
@ 2006-06-09 10:49 ` Andreas Dilger
2006-06-09 11:26 ` Alex Tomas
2 siblings, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 10:49 UTC (permalink / raw)
To: Christoph Hellwig, Mingming Cao, linux-kernel, ext2-devel,
linux-fsdevel
On Jun 09, 2006 10:13 +0100, Christoph Hellwig wrote:
> the block numbers are't the big problem concerning scalability, there's
> a lot more to it, like btree(-like) structures in the allocator, parallel
> alloocator algorithms and a better allocation group concept.
All of the allocator changes are already written and well tested, and gave
ext3 a 30% performance improvement while at the same time reducing CPU
usage by 50% - not trivial. See Holger Kiehl's post
http://marc.theaimsgroup.com/?l=linux-kernel&m=114958967600822&w=4
> If you guys want big storage on linux please help improving the filesystems
> design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
> both making ext3 less reliable for us desktop/small server users and not get
> the full thing for the big storage people either.
XFS = 108844 lines, and a complete mess to understand.
ext3+jbd = 27749 lines (includes ~6000 lines extent/allocation changes)
Also, ext3 is just much more robust in the face of on-disk corruption
than xfs or jfs because of its "static" layout, and e2fsck is way
better than the alternatives. Despite their "big storage" designs ext3
is still competitive in performance, especially with the allocation
improvements. See also:
http://marc.theaimsgroup.com/?l=ext2-devel&m=108194477207334&w=4
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=110112879929869&w=4
http://samba.org/~tridge/xattr_results/xfs-ext3-tuning.png
If the XFS or JFS maintainers want to fix their filesystems, they are free
to do so, the ext3 maintainers (all of them, btw) want these changes.
The extent code (prior to some minor cleanups for landing on the vanilla
kernel) has already seen many millions of hours of testing in very heavy
IO environments, so it isn't something that was just written. If extents
aren't enabled it amounts to a couple of extra conditionals in the
allocation path and basically no modifications to the existing code, so
you can safely avoid this code if you feel the need to.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 9:13 ` Christoph Hellwig
2006-06-09 10:07 ` Andrew Morton
2006-06-09 10:49 ` Andreas Dilger
@ 2006-06-09 11:26 ` Alex Tomas
2006-06-09 14:23 ` [Ext2-devel] " Jeff Garzik
2 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 11:26 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, ext2-devel, Mingming Cao, linux-kernel
>>>>> Christoph Hellwig (CH) writes:
CH> If you guys want big storage on linux please help improving the filesystems
CH> design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
CH> both making ext3 less reliable for us desktop/small server users and not get
CH> the full thing for the big storage people either.
proposed patches don't touch existing code paths.
extents may be enabled/disabled on per-file basis.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 11:26 ` Alex Tomas
@ 2006-06-09 14:23 ` Jeff Garzik
2006-06-09 14:33 ` Alex Tomas
2006-06-09 14:34 ` Alex Tomas
0 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 14:23 UTC (permalink / raw)
To: Alex Tomas
Cc: Christoph Hellwig, Mingming Cao, linux-kernel, ext2-devel,
linux-fsdevel
Alex Tomas wrote:
>>>>>> Christoph Hellwig (CH) writes:
>
> CH> If you guys want big storage on linux please help improving the filesystems
> CH> design for that, e.g. jfs or xfs instead of showhorning it onto ext3 thus
> CH> both making ext3 less reliable for us desktop/small server users and not get
> CH> the full thing for the big storage people either.
>
> proposed patches don't touch existing code paths.
> extents may be enabled/disabled on per-file basis.
And thus, inodes are progressively incompatible with older kernels.
Boot into an older kernel, and you can now only read half your
filesystem (if it even allows mount at all).
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 14:23 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 14:33 ` Alex Tomas
2006-06-09 14:34 ` Alex Tomas
1 sibling, 0 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 14:33 UTC (permalink / raw)
To: Jeff Garzik
Cc: Alex Tomas, Christoph Hellwig, Mingming Cao, linux-kernel,
ext2-devel, linux-fsdevel
>>>>> Jeff Garzik (JG) writes:
JG> And thus, inodes are progressively incompatible with older
JG> kernels. Boot into an older kernel, and you can now only read half
JG> your filesystem (if it even allows mount at all).
nope, you aren't allowed to mount fs with extents-enabled files
by ext3 which has no the feature compiled in. the same will
happen if you call it ext4.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 14:23 ` [Ext2-devel] " Jeff Garzik
2006-06-09 14:33 ` Alex Tomas
@ 2006-06-09 14:34 ` Alex Tomas
2006-06-09 14:35 ` Jeff Garzik
1 sibling, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 14:34 UTC (permalink / raw)
To: Jeff Garzik
Cc: ext2-devel, linux-kernel, Christoph Hellwig, Mingming Cao,
linux-fsdevel, Alex Tomas
>>>>> Jeff Garzik (JG) writes:
JG> And thus, inodes are progressively incompatible with older
JG> kernels. Boot into an older kernel, and you can now only read half
JG> your filesystem (if it even allows mount at all).
nope, you aren't allowed to mount fs with extents-enabled files
by ext3 which has no the feature compiled in. the same will
happen if you call it ext4.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 14:34 ` Alex Tomas
@ 2006-06-09 14:35 ` Jeff Garzik
2006-06-09 14:57 ` Alex Tomas
0 siblings, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 14:35 UTC (permalink / raw)
To: Alex Tomas
Cc: Christoph Hellwig, linux-fsdevel, ext2-devel, Mingming Cao,
linux-kernel
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> And thus, inodes are progressively incompatible with older
> JG> kernels. Boot into an older kernel, and you can now only read half
> JG> your filesystem (if it even allows mount at all).
>
> nope, you aren't allowed to mount fs with extents-enabled files
> by ext3 which has no the feature compiled in. the same will
> happen if you call it ext4.
This is my point... why increase user confusion by calling it ext3, then?
Extent magnify the "what ext3 filesystem am I talking to, today?" problem.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 14:35 ` Jeff Garzik
@ 2006-06-09 14:57 ` Alex Tomas
2006-06-09 15:17 ` [Ext2-devel] " Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 14:57 UTC (permalink / raw)
To: Jeff Garzik
Cc: ext2-devel, linux-kernel, Christoph Hellwig, Mingming Cao,
linux-fsdevel, Alex Tomas
>>>>> Jeff Garzik (JG) writes:
JG> Alex Tomas wrote:
>>>>>>> Jeff Garzik (JG) writes:
JG> And thus, inodes are progressively incompatible with older
JG> kernels. Boot into an older kernel, and you can now only read half
JG> your filesystem (if it even allows mount at all).
>> nope, you aren't allowed to mount fs with extents-enabled files
>> by ext3 which has no the feature compiled in. the same will
>> happen if you call it ext4.
JG> This is my point... why increase user confusion by calling it ext3, then?
by default it's still old good ext3 without extents. user should
enable it explicitly. for him, this means the feature is ready
to be used anytime. the only thing he needs is to (re)mount fs
with the option. for us, this means: a) a single source tree -
easy to maintain b) we must be clear with user that the feature
isn't backward compatible
thanks, Alex
PS. in the end this is just ext3 with one more feature ...
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 14:57 ` Alex Tomas
@ 2006-06-09 15:17 ` Jeff Garzik
2006-06-09 16:21 ` Mike Snitzer
2006-06-09 16:56 ` Andreas Dilger
0 siblings, 2 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 15:17 UTC (permalink / raw)
To: Alex Tomas
Cc: Christoph Hellwig, Mingming Cao, linux-kernel, ext2-devel,
linux-fsdevel
Alex Tomas wrote:
> PS. in the end this is just ext3 with one more feature ...
Incorrect. You have to look at ext3 development over time. This is a
PATTERN with ext3 development: mutating the metadata over time in a
progressively incompatible manner.
You have this thing called "ext3", which fools an admin into thinking
they can use their filesystem with any kernel that has "ext3" support.
That's somewhat true today, but with extents it will become false.
Having a mutating definition of "ext3" is a convenience for developers,
and for users WHO ONLY MOVE FORWARD in kernel versions.
A 48bit ext3 filesystem with extents is completely unusable in 2.4.30's
"ext3" or 2.6.10's "ext3". Users are forced to hunt down the specific
kernel version when an incompatible feature was added to ext3. How can
that possibly be described as "user friendly"?
"Which ext3 am I talking to, today?"
"And which kernels am I locked into, in order to talk to my filesystem?"
Not all users are big production houses that plan their filesystem
metadata migration months in advance! I _guarantee_ some users will
boot into ext3-with-extents, use it for a while, and then try to
downgrade for whatever reason... only to find they have been LOCKED
OUT. That is a very real world situation, guys.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 15:17 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 16:21 ` Mike Snitzer
2006-06-09 16:27 ` Jeff Garzik
2006-06-09 16:33 ` Alex Tomas
2006-06-09 16:56 ` Andreas Dilger
1 sibling, 2 replies; 295+ messages in thread
From: Mike Snitzer @ 2006-06-09 16:21 UTC (permalink / raw)
To: Jeff Garzik
Cc: Alex Tomas, Christoph Hellwig, Mingming Cao, linux-kernel,
ext2-devel, linux-fsdevel
On 6/9/06, Jeff Garzik <jeff@garzik.org> wrote:
> Alex Tomas wrote:
> > PS. in the end this is just ext3 with one more feature ...
>
> Incorrect. You have to look at ext3 development over time. This is a
> PATTERN with ext3 development: mutating the metadata over time in a
> progressively incompatible manner.
>
> You have this thing called "ext3", which fools an admin into thinking
> they can use their filesystem with any kernel that has "ext3" support.
> That's somewhat true today, but with extents it will become false.
> Having a mutating definition of "ext3" is a convenience for developers,
> and for users WHO ONLY MOVE FORWARD in kernel versions.
>
> A 48bit ext3 filesystem with extents is completely unusable in 2.4.30's
> "ext3" or 2.6.10's "ext3". Users are forced to hunt down the specific
> kernel version when an incompatible feature was added to ext3. How can
> that possibly be described as "user friendly"?
>
> "Which ext3 am I talking to, today?"
> "And which kernels am I locked into, in order to talk to my filesystem?"
>
> Not all users are big production houses that plan their filesystem
> metadata migration months in advance! I _guarantee_ some users will
> boot into ext3-with-extents, use it for a while, and then try to
> downgrade for whatever reason... only to find they have been LOCKED
> OUT. That is a very real world situation, guys.
Jeff,
I think all of us do understand what you're saying and on some level
are willing to accept that ext3-with-extents is in fact worthy of
branching to ext4, hence the url that has hosted the development of
extents (mballoc, delalloc, 48bit etc):
http://www.bullopensource.org/ext4/
But it _seems_ you're trying to paint ALL the ext3-developers as a
narrow minded lot. If and when users decide to enable ext3 extents on
their filesystems they will presumably understand that doing so
precludes their ability to boot older kernels (steps can be taken to
make them well aware of this). The "real world situation" you refer
to, while hypothetically valid, isn't something informed
ext3-with-extents users will _ever_ elect to do.
Once a compelling feature is introduced Linux users embrace it and
never look back (provided it is stable!). The real risk is the
(in)stability of all these ext3 improvements. Stability is obviously
a requirement for merging these changes but I for one find it
refreshing that the current desire is to merge extents with ext3
(implicitly speaks to its stability when you couple that desire with
the fact that so many ext3 stakeholders are onboard!).
And as an aside, merging extents with ext3 forces ext3-developers to
be somewhat conservative about what bells and whistles they'd be
introducing moving forward. The worst thing would be for these ext3
improvements to get merged into a new ext4 that becomes wildly known
as "the experimental ext3++". I suppose developer discipline would
prevent such an unfortunate distinction but a new ext4 sandbox _could_
open the flood gates.
Developers never _want_ to branch (maintenance-hell), the question
becomes: do the risks associated with ext3-with-extents' backword
incompatibility _really_ justify the branch?
Mike
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:21 ` Mike Snitzer
@ 2006-06-09 16:27 ` Jeff Garzik
2006-06-09 16:48 ` Alex Tomas
2006-06-09 16:33 ` Alex Tomas
1 sibling, 1 reply; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:27 UTC (permalink / raw)
To: Mike Snitzer
Cc: Alex Tomas, Christoph Hellwig, Mingming Cao, linux-kernel,
ext2-devel, linux-fsdevel
Mike Snitzer wrote:
> Developers never _want_ to branch (maintenance-hell), the question
> becomes: do the risks associated with ext3-with-extents' backword
> incompatibility _really_ justify the branch?
It's also a question of... why keep adding modernizing features to
ext3, thus keeping it on life support, but just barely? If we are going
to modernize the _main Linux filesystem_, let's not do it in a way that
is slow, and ties our hands.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:27 ` Jeff Garzik
@ 2006-06-09 16:48 ` Alex Tomas
2006-06-09 16:51 ` Jeff Garzik
0 siblings, 1 reply; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 16:48 UTC (permalink / raw)
To: Jeff Garzik
Cc: Mike Snitzer, Alex Tomas, Christoph Hellwig, Mingming Cao,
linux-kernel, ext2-devel, linux-fsdevel
>>>>> Jeff Garzik (JG) writes:
JG> It's also a question of... why keep adding modernizing features to
JG> ext3, thus keeping it on life support, but just barely? If we are
JG> going to modernize the _main Linux filesystem_, let's not do it in a
JG> way that is slow, and ties our hands.
I think trying to solve all problems at once will take much longer.
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:48 ` Alex Tomas
@ 2006-06-09 16:51 ` Jeff Garzik
0 siblings, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:51 UTC (permalink / raw)
To: Alex Tomas
Cc: ext2-devel, linux-kernel, Christoph Hellwig, Mingming Cao,
linux-fsdevel
Alex Tomas wrote:
>>>>>> Jeff Garzik (JG) writes:
>
> JG> It's also a question of... why keep adding modernizing features to
> JG> ext3, thus keeping it on life support, but just barely? If we are
> JG> going to modernize the _main Linux filesystem_, let's not do it in a
> JG> way that is slow, and ties our hands.
>
> I think trying to solve all problems at once will take much longer.
I guess it's a good thing that real world development never works like that.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:21 ` Mike Snitzer
2006-06-09 16:27 ` Jeff Garzik
@ 2006-06-09 16:33 ` Alex Tomas
2006-06-09 16:37 ` [Ext2-devel] " Jeff Garzik
2006-06-09 22:52 ` Valdis.Kletnieks
1 sibling, 2 replies; 295+ messages in thread
From: Alex Tomas @ 2006-06-09 16:33 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jeff Garzik, ext2-devel, linux-kernel, Christoph Hellwig,
Mingming Cao, linux-fsdevel, Alex Tomas
>>>>> Mike Snitzer (MS) writes:
MS> precludes their ability to boot older kernels (steps can be taken to
MS> make them well aware of this). The "real world situation" you refer
MS> to, while hypothetically valid, isn't something informed
MS> ext3-with-extents users will _ever_ elect to do.
one who needs/wants to go back may get rid of extents by:
a) remounting w/o extents option
b) copying new-fashion-style files so that copies use blockmap
c) dropping extents feature in superblock
thanks, Alex
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:33 ` Alex Tomas
@ 2006-06-09 16:37 ` Jeff Garzik
2006-06-09 22:52 ` Valdis.Kletnieks
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 16:37 UTC (permalink / raw)
To: Alex Tomas
Cc: Mike Snitzer, Christoph Hellwig, Mingming Cao, linux-kernel,
ext2-devel, linux-fsdevel
Alex Tomas wrote:
>>>>>> Mike Snitzer (MS) writes:
>
> MS> precludes their ability to boot older kernels (steps can be taken to
> MS> make them well aware of this). The "real world situation" you refer
> MS> to, while hypothetically valid, isn't something informed
> MS> ext3-with-extents users will _ever_ elect to do.
>
> one who needs/wants to go back may get rid of extents by:
> a) remounting w/o extents option
> b) copying new-fashion-style files so that copies use blockmap
> c) dropping extents feature in superblock
More likely, they will just backup+restore rather than go through all that.
After leafing through a 50-page manual to match up kernel versions with
ext3 features, to see which older kernels will (or won't) require all
this work.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:33 ` Alex Tomas
2006-06-09 16:37 ` [Ext2-devel] " Jeff Garzik
@ 2006-06-09 22:52 ` Valdis.Kletnieks
2006-06-09 23:21 ` Andreas Dilger
1 sibling, 1 reply; 295+ messages in thread
From: Valdis.Kletnieks @ 2006-06-09 22:52 UTC (permalink / raw)
To: Alex Tomas
Cc: Mike Snitzer, Jeff Garzik, Christoph Hellwig, Mingming Cao,
linux-kernel, ext2-devel, linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 565 bytes --]
On Fri, 09 Jun 2006 20:33:18 +0400, Alex Tomas said:
> one who needs/wants to go back may get rid of extents by:
> a) remounting w/o extents option
> b) copying new-fashion-style files so that copies use blockmap
> c) dropping extents feature in superblock
OK.. Obviously my brain is tiny and easily overfilled.
Given that the whole alledged problem with extents is that they're not
backward compatible, how do you read the files in (b) so that you can copy
them, if the data is in the non-compatible extents that you can't read because
you've disabled extents?
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 22:52 ` Valdis.Kletnieks
@ 2006-06-09 23:21 ` Andreas Dilger
2006-06-10 1:21 ` Valdis.Kletnieks
0 siblings, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 23:21 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Alex Tomas, Jeff Garzik, ext2-devel, linux-kernel,
Christoph Hellwig, Mingming Cao, linux-fsdevel
On Jun 09, 2006 18:52 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 09 Jun 2006 20:33:18 +0400, Alex Tomas said:
> > one who needs/wants to go back may get rid of extents by:
> > a) remounting w/o extents option
> > b) copying new-fashion-style files so that copies use blockmap
> > c) dropping extents feature in superblock
>
> OK.. Obviously my brain is tiny and easily overfilled.
...
> Given that the whole alledged problem with extents is that they're not
> backward compatible, how do you read the files in (b) so that you can copy
> them, if the data is in the non-compatible extents that you can't read because
> you've disabled extents?
You mount with the new kernel without "-o extents", and find files with
extents "lsattr -R /mnt/tmp | awk '/----e / print { $2 }'", copy those
files, mv over old files, unmount.
A similar thing is necessary for ext3 filesystems before you can mount them
as ext2 - they can't be mounted as ext2 until the journal is recovered
(an unrecovered journal is an incompatible feature).
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 23:21 ` Andreas Dilger
@ 2006-06-10 1:21 ` Valdis.Kletnieks
2006-06-10 2:09 ` [Ext2-devel] " Andreas Dilger
0 siblings, 1 reply; 295+ messages in thread
From: Valdis.Kletnieks @ 2006-06-10 1:21 UTC (permalink / raw)
To: Andreas Dilger
Cc: Jeff Garzik, ext2-devel, linux-kernel, Christoph Hellwig,
Mingming Cao, linux-fsdevel, Alex Tomas
[-- Attachment #1.1: Type: text/plain, Size: 795 bytes --]
On Fri, 09 Jun 2006 17:21:08 MDT, Andreas Dilger said:
> You mount with the new kernel without "-o extents", and find files with
> extents "lsattr -R /mnt/tmp | awk '/----e / print { $2 }'", copy those
> files, mv over old files, unmount.
How do you "copy those files" when you don't have extent support at that
point? Remember - the whole problem here is that if you don't have
extent support, you can't read the file, it's backward-incompatible.
(If you *are* able to read the file even without extents, then this whole
thread is total BS).
You can certainly at least try to copy them to another file system
while the source *is* mounted with -o extents, and then mount without it
and copy the files back, but (a) that isn't what you said and (b) it doesn't
work for files over 2T or so..
[-- Attachment #1.2: Type: application/pgp-signature, Size: 226 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
[-- Attachment #3: Type: text/plain, Size: 161 bytes --]
_______________________________________________
Ext2-devel mailing list
Ext2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ext2-devel
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 1:21 ` Valdis.Kletnieks
@ 2006-06-10 2:09 ` Andreas Dilger
2006-06-10 2:45 ` Nicholas Miell
0 siblings, 1 reply; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 2:09 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Alex Tomas, Jeff Garzik, ext2-devel, linux-kernel,
Christoph Hellwig, Mingming Cao, linux-fsdevel
On Jun 09, 2006 21:21 -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 09 Jun 2006 17:21:08 MDT, Andreas Dilger said:
> > You mount with the new kernel without "-o extents", and find files with
> > extents "lsattr -R /mnt/tmp | awk '/----e / print { $2 }'", copy those
> > files, mv over old files, unmount.
>
> How do you "copy those files" when you don't have extent support at that
> point? Remember - the whole problem here is that if you don't have
> extent support, you can't read the file, it's backward-incompatible.
> (If you *are* able to read the file even without extents, then this whole
> thread is total BS).
The "-o extents" mount option only affects new files that are created
while that option is enabled. It doesn't affect existing files (even if
they are modified while "-o extents" is set). It also doesn't affect any
new files after "-o extents" is removed. Also, directories will not
be extent-mapped, because their allocation pattern doesn't mix well with
extent-mapped files (i.e. they are mostly single-block allocations).
Files that are created with "-o extents" are of course only readable with
a kernel that supports it. To be safe, the whole filesystem is marked
with an EXT3_FEATURE_INCOMPAT_EXTENTS flag when the first extent file
is created so that users don't inadvertently get strange errors while
accessing the inodes marked with EXT3_EXTENT_FL with an old kernel.
New kernels that understand INCOMPAT_EXTENTS of course can access extent
and non-extent files equally well.
In an emergency it would also be possible to remove the INCOMPAT_EXTENTS
filesystem flag and access all of the non-extent files, but this would
risk filesystem corruption if any of the extent files were modified or
unlinked, as that is the only indication older kernels have of this change.
So, to answer your question, if you _really_ want to get rid of extents
on a filesystem, you mount the filesystem with INCOMPAT_EXTENTS on a new
kernel that supports extents, but without -o extents so new files will
use the old block-map layout, so if "orig-file" is an extent-mapped file:
cp /mnt/tmp/orig-file /mnt/tmp/temp-block-mapped-file
mv /mnt/tmp/temp-block-mapped-file /mnt/tmp/orig-file
and now /mnt/tmp/orig-file is no longer extent-mapped. Do this for all
the extent-mapped files, unmount, use "debugfs -w -R 'feature ^extents' {dev}"
and your filesystem is mountable with any old kernel.
No, it's not quite as easy as ext3 journal recovery->ext2 mounting,
but then again "-o extents" isn't something that happens automatically
(at least not for a couple of years, and hopefully distros will be smart
enough never to do this for filesystems like /boot or / that are critical
for mounting on a wide variety of kernels. Besides which, we don't want
to have to teach GRUB about extent-mapped files. Concievably, if this
becomes an issue then it should be possible to add a flag to inodes and
parent directories to add a "no extents" flag that is inherited by new
files that should never be extent mapped.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 2:09 ` [Ext2-devel] " Andreas Dilger
@ 2006-06-10 2:45 ` Nicholas Miell
2006-06-10 4:29 ` Andreas Dilger
0 siblings, 1 reply; 295+ messages in thread
From: Nicholas Miell @ 2006-06-10 2:45 UTC (permalink / raw)
To: Andreas Dilger
Cc: Valdis.Kletnieks, Alex Tomas, Jeff Garzik, ext2-devel,
linux-kernel, Christoph Hellwig, Mingming Cao, linux-fsdevel
On Fri, 2006-06-09 at 20:09 -0600, Andreas Dilger wrote:
> On Jun 09, 2006 21:21 -0400, Valdis.Kletnieks@vt.edu wrote:
> > On Fri, 09 Jun 2006 17:21:08 MDT, Andreas Dilger said:
> > > You mount with the new kernel without "-o extents", and find files with
> > > extents "lsattr -R /mnt/tmp | awk '/----e / print { $2 }'", copy those
> > > files, mv over old files, unmount.
> >
> > How do you "copy those files" when you don't have extent support at that
> > point? Remember - the whole problem here is that if you don't have
> > extent support, you can't read the file, it's backward-incompatible.
> > (If you *are* able to read the file even without extents, then this whole
> > thread is total BS).
>
> The "-o extents" mount option only affects new files that are created
> while that option is enabled. It doesn't affect existing files (even if
> they are modified while "-o extents" is set). It also doesn't affect any
> new files after "-o extents" is removed. Also, directories will not
> be extent-mapped, because their allocation pattern doesn't mix well with
> extent-mapped files (i.e. they are mostly single-block allocations).
>
> Files that are created with "-o extents" are of course only readable with
> a kernel that supports it. To be safe, the whole filesystem is marked
> with an EXT3_FEATURE_INCOMPAT_EXTENTS flag when the first extent file
> is created so that users don't inadvertently get strange errors while
> accessing the inodes marked with EXT3_EXTENT_FL with an old kernel.
> New kernels that understand INCOMPAT_EXTENTS of course can access extent
> and non-extent files equally well.
>
> In an emergency it would also be possible to remove the INCOMPAT_EXTENTS
> filesystem flag and access all of the non-extent files, but this would
> risk filesystem corruption if any of the extent files were modified or
> unlinked, as that is the only indication older kernels have of this change.
>
> So, to answer your question, if you _really_ want to get rid of extents
> on a filesystem, you mount the filesystem with INCOMPAT_EXTENTS on a new
> kernel that supports extents, but without -o extents so new files will
> use the old block-map layout, so if "orig-file" is an extent-mapped file:
>
> cp /mnt/tmp/orig-file /mnt/tmp/temp-block-mapped-file
> mv /mnt/tmp/temp-block-mapped-file /mnt/tmp/orig-file
>
> and now /mnt/tmp/orig-file is no longer extent-mapped. Do this for all
> the extent-mapped files, unmount, use "debugfs -w -R 'feature ^extents' {dev}"
> and your filesystem is mountable with any old kernel.
>
> No, it's not quite as easy as ext3 journal recovery->ext2 mounting,
> but then again "-o extents" isn't something that happens automatically
> (at least not for a couple of years, and hopefully distros will be smart
> enough never to do this for filesystems like /boot or / that are critical
> for mounting on a wide variety of kernels. Besides which, we don't want
> to have to teach GRUB about extent-mapped files. Concievably, if this
> becomes an issue then it should be possible to add a flag to inodes and
> parent directories to add a "no extents" flag that is inherited by new
> files that should never be extent mapped.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
I think changing all of this mess to:
[root@localhost root]# tune2fs -O extents /dev/whatever
WARNING: Enabling extents on /dev/whatever will make this filesystem
unreadable in Linux kernels versions before 2.6.19!
Are you sure you want to do this? <y/n>
[root@localhost root]# tune2fs -O ^extents /dev/whatever
WARNING: Disabling extents on /dev/whatever requires you to run e2fsck
on this filesystem before it can be used again!
Are you sure you want to do this? <y/n>
might assuage many of the fears presented in this thread.
--
Nicholas Miell <nmiell@comcast.net>
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-10 2:45 ` Nicholas Miell
@ 2006-06-10 4:29 ` Andreas Dilger
0 siblings, 0 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-10 4:29 UTC (permalink / raw)
To: Nicholas Miell
Cc: Valdis.Kletnieks, Alex Tomas, Jeff Garzik, ext2-devel,
linux-kernel, Christoph Hellwig, Mingming Cao, linux-fsdevel
On Jun 09, 2006 19:45 -0700, Nicholas Miell wrote:
> I think changing all of this mess to:
>
> [root@localhost root]# tune2fs -O extents /dev/whatever
> WARNING: Enabling extents on /dev/whatever will make this filesystem
> unreadable in Linux kernels versions before 2.6.19!
> Are you sure you want to do this? <y/n>
>
> [root@localhost root]# tune2fs -O ^extents /dev/whatever
> WARNING: Disabling extents on /dev/whatever requires you to run e2fsck
> on this filesystem before it can be used again!
> Are you sure you want to do this? <y/n>
>
> might assuage many of the fears presented in this thread.
If that were true, then I'd be happy to make this the barrier to entry.
Sadly, I don't think that is the only issue, but I'm happy to be shown
to be wrong.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 15:17 ` [Ext2-devel] " Jeff Garzik
2006-06-09 16:21 ` Mike Snitzer
@ 2006-06-09 16:56 ` Andreas Dilger
2006-06-09 17:32 ` [Ext2-devel] " Greg KH
2006-06-09 18:48 ` Jeff Garzik
1 sibling, 2 replies; 295+ messages in thread
From: Andreas Dilger @ 2006-06-09 16:56 UTC (permalink / raw)
To: Jeff Garzik
Cc: ext2-devel, linux-kernel, Christoph Hellwig, Mingming Cao,
linux-fsdevel, Alex Tomas
On Jun 09, 2006 11:17 -0400, Jeff Garzik wrote:
> Not all users are big production houses that plan their filesystem
> metadata migration months in advance! I _guarantee_ some users will
> boot into ext3-with-extents, use it for a while, and then try to
> downgrade for whatever reason... only to find they have been LOCKED
> OUT. That is a very real world situation, guys.
Except that the only way that they will get extents is if they read some
documentation that tells them to mount with "-o extents", which will also
say "this is incompatible with older kernels - only use it if you aren't
going to revert to older kernels". If they try to mount such a filesystem
it will report "trying to mount filesystem with incompatible feature",
and "e2fsprogs" will report "incompatible feature extents - please upgrade
your e2fsprogs" (for versions newer than Nov 2004).
It's a lot better than e.g. the latest ubuntu which (apparently,
I read) can't mount a kernel older than 2.6.15 because of udev (or
sysfs?) changes. It's better than e.g. reiserfs vs. reiser4 compatibility
(which doesn't exist). 2.4 kernels probably can't mount a new udev root
filesystem because none of the /dev files exist either. 2.4 kernels can't
mount a filesystem that is using device mapper ("LVM 2.0") instead of
"LVM 1.0". All 2.2 kernel.org kernels couldn't use any system with RAID,
because any distro worth its salt had upgraded the RAID code to a working
(incompatible) version.
Nobody is forcing users to use extents. Same with large inodes in ext3,
which give a 7x speedup in samba4 performance - did this cause you any
heartburn yet? Large inodes + fast EAs are available for people who want
to use it for a couple of years already, will soon allow nanosecond times
and maybe one day in the distant future it will become the default but not
yet. In a few years, the support for extents in ext3 will be pervasive
and most people won't care if they can boot to 2.4.10 or not, and if they
care about this they will also know enough not to enable extents. The ext3
developers are a very cautious bunch, and don't force anything onto users.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [Ext2-devel] [RFC 0/13] extents and 48bit ext3
2006-06-09 16:56 ` Andreas Dilger
@ 2006-06-09 17:32 ` Greg KH
2006-06-09 18:48 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: Greg KH @ 2006-06-09 17:32 UTC (permalink / raw)
To: Jeff Garzik, Alex Tomas, Christoph Hellwig, linux-fsdevel,
ext2-devel, Mingming Cao, linux-kernel
On Fri, Jun 09, 2006 at 10:56:43AM -0600, Andreas Dilger wrote:
> It's a lot better than e.g. the latest ubuntu which (apparently,
> I read) can't mount a kernel older than 2.6.15 because of udev (or
> sysfs?) changes.
If this is true, then it's only because the Ubuntu developers do not
want to support older kernel versions. Other distros handle this just
fine (Gentoo and Debian for example). This is not a kernel issue, but
rather a distro design issue.
Which is much different from the fact that I take a "ext3" partition
from my new distro and can't get to the data if I downgrade to an older
distro for whatever reason (or use an older rescue disk.)
Don't confuse distro design decisions from issues forced on an unknowing
user by the ext3 fs kernel developers.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 295+ messages in thread
* Re: [RFC 0/13] extents and 48bit ext3
2006-06-09 16:56 ` Andreas Dilger
2006-06-09 17:32 ` [Ext2-devel] " Greg KH
@ 2006-06-09 18:48 ` Jeff Garzik
1 sibling, 0 replies; 295+ messages in thread
From: Jeff Garzik @ 2006-06-09 18:48 UTC (permalink / raw)
To: Jeff Garzik, Alex Tomas, Christoph Hellwig, linux-fsdevel,
ext2-devel, Mingming Cao, linux-kernel
Andreas Dilger wrote:
> Except that the only way that they will get extents is if they read some
> documentation that tells them to mount with "-o extents", which will also
> say "this is incompatible with older kernels - only use it if you aren't
> going to revert to older kernels". If they try to mount such a filesystem
> it will report "trying to mount filesystem with incompatible feature",
> and "e2fsprogs" will report "incompatible feature extents - please upgrade
> your e2fsprogs" (for versions newer than Nov 2004).
False. What will happen is that distros will default to extents, and
users will continue to not read documentation, as usual.
> It's a lot better than e.g. the latest ubuntu which (apparently,
> I read) can't mount a kernel older than 2.6.15 because of udev (or
> sysfs?) changes. It's better than e.g. reiserfs vs. reiser4 compatibility
> (which doesn't exist). 2.4 kernels probably can't mount a new udev root
> filesystem because none of the /dev files exist either. 2.4 kernels can't
> mount a filesystem that is using device mapper ("LVM 2.0") instead of
> "LVM 1.0". All 2.2 kernel.org kernels couldn't use any system with RAID,
> because any distro worth its salt had upgraded the RAID code to a working
> (incompatible) version.
This is different.
The proposal is to change the thing called "ext3" to suddenly require
kernels >= 2.6.18, while still calling it "ext3."
The above examples are actually proving my point. The above examples
had much more clear distinctions between incompatible upgrades.
> Nobody is forcing users to use extents. Same with large inodes in ext3,
> which give a 7x speedup in samba4 performance - did this cause you any
> heartburn yet? Large inodes + fast EAs are available for people who want
> to use it for a couple of years already, will soon allow nanosecond times
> and maybe one day in the distant future it will become the default but not
> yet. In a few years, the support for extents in ext3 will be pervasive
> and most people won't care if they can boot to 2.4.10 or not, and if they
> care about this they will also know enough not to enable extents. The ext3
> developers are a very cautious bunch, and don't force anything onto users.
I wouldn't use the word "cautious" to describe continually adding new,
incompatible features to the main Linux filesystem.
You are as cautious as one can be, while adding potentially
destabilizing features.
Jeff
^ permalink raw reply [flat|nested] 295+ messages in thread
* [RFC][Update 0/16]extents and 48bit ext3/4 patches
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (2 preceding siblings ...)
2006-06-09 9:13 ` Christoph Hellwig
@ 2006-06-30 0:16 ` Mingming Cao
2006-06-30 0:16 ` [RFC][Update][Patch 1/16]core extent map support Mingming Cao
` (15 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:16 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
Hello!
Here is the updated ext3/4 patches to support 48bit ext4 filesystem.
http://ext2.sourceforge.net/48bitext3/patches/latest/
Changes since last post includes
- bug fixes in 64bit JBD changes, which breaks non-extents 32 bit ext3
filesystem
- sync up on disk super block structure with e2fsprogs
- add initial handing of uninitialized extents
- removed 32 bit ext3 bug fixes patches from this series as they are
folded to current linus git tree.
Patches against 2.6.17-git13, tested on both 32 bit and 64 bit arch,
survived fsx test on 32bit ext3(mounted w/o extent, with CONFIG_LBD
enabled) and 48 bit ext4(mounted with extents).
Appreciate any comments and feedbacks.
Thanks,
Mingming
-------------------------------------
patch series:
#------------------------
# base extent support (32bit)
#------------------------
ext3-extents.patch
#------------------------
# 48bit ext3 patches
#-------------------------
# 64 bit in-kernel block number support
#
#sector_t type format string for all arch
sector_fmt.patch
#support >32 bit fs block type in kernel (convert ext3_fsblk_t to sector_t)
ext3_fsblk_sector_t.patch
#
#48 bit extent map patches
#
ext3-extents-48bit.patch
ext3-extents-ext3_fsblk_t.patch
ext3-unitialized-extent-handling.patch
#
# 64bit JBD support
#
64bit_jbd_core.patch
jbd-avoid-blk-overflow-write-journal-metadata-tag.patch
jbd-read-32bit-tag-fix.patch
jbd-cleanup-journal_tag_bytes.patch
sector_t-jbd.patch
jbd-revoke-32bit-shift-fix.patch
#
# 48 bit on-disk xttar support
#
ext3_48bit_i_file_acl.patch
#
# 64bit on-disk sb metadata changes
#
64bit-metadata.patch
64bit-incompat-flag-change.patch
ext3-sb-struc-sync-with-e2fsprog.patch
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 1/16]core extent map support
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (3 preceding siblings ...)
2006-06-30 0:16 ` [RFC][Update 0/16]extents and 48bit ext3/4 patches Mingming Cao
@ 2006-06-30 0:16 ` Mingming Cao
2006-06-30 0:17 ` [RFC][Update][Patch 2/16]sector_t type format string Mingming Cao
` (14 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:16 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
Add extent map support to ext3. Patch from Alex Tomas.
On disk extents format:
/*
* this is extent on-disk structure
* it's used at the bottom of the tree
*/
struct ext3_extent {
__le32 ee_block; /* first logical block extent covers */
__le16 ee_len; /* number of blocks covered by extent */
__le16 ee_start_hi; /* high 16 bits of physical block */
__le32 ee_start; /* low 32 bigs of physical block */
};
Signed-off-by: Alex Tomas <alex@clusterfs.com>
---
linux-2.6.17-ming/fs/ext3/Makefile | 2
linux-2.6.17-ming/fs/ext3/dir.c | 3
linux-2.6.17-ming/fs/ext3/extents.c | 2069 ++++++++++++++++++++++
linux-2.6.17-ming/fs/ext3/ialloc.c | 11
linux-2.6.17-ming/fs/ext3/inode.c | 17
linux-2.6.17-ming/fs/ext3/ioctl.c | 1
linux-2.6.17-ming/fs/ext3/super.c | 10
linux-2.6.17-ming/include/linux/ext3_fs.h | 31
linux-2.6.17-ming/include/linux/ext3_fs_extents.h | 196 ++
linux-2.6.17-ming/include/linux/ext3_fs_i.h | 13
linux-2.6.17-ming/include/linux/ext3_fs_sb.h | 10
linux-2.6.17-ming/include/linux/ext3_jbd.h | 17
12 files changed, 2361 insertions(+), 19 deletions(-)
diff -puN fs/ext3/dir.c~ext3-extents fs/ext3/dir.c
--- linux-2.6.17/fs/ext3/dir.c~ext3-extents 2006-06-28 13:25:19.629991124 -0700
+++ linux-2.6.17-ming/fs/ext3/dir.c 2006-06-28 13:25:19.674985962 -0700
@@ -131,8 +131,7 @@ static int ext3_readdir(struct file * fi
struct buffer_head *bh = NULL;
map_bh.b_state = 0;
- err = ext3_get_blocks_handle(NULL, inode, blk, 1,
- &map_bh, 0, 0);
+ err = ext3_get_blocks_wrap(NULL, inode, blk, 1, &map_bh, 0, 0);
if (err > 0) {
page_cache_readahead(sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra,
diff -puN /dev/null fs/ext3/extents.c
--- /dev/null 2006-06-28 00:02:13.345547960 -0700
+++ linux-2.6.17-ming/fs/ext3/extents.c 2006-06-28 13:39:22.744250572 -0700
@@ -0,0 +1,2069 @@
+/*
+ * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com
+ * Written by Alex Tomas <alex@clusterfs.com>
+ *
+ * Architecture independence:
+ * Copyright (c) 2005, Bull S.A.
+ * Written by Pierre Peiffer <pierre.peiffer@bull.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
+ */
+
+/*
+ * Extents support for EXT3
+ *
+ * TODO:
+ * - ext3*_error() should be used in some situations
+ * - analyze all BUG()/BUG_ON(), use -EIO where appropriate
+ * - smart tree reduction
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/time.h>
+#include <linux/ext3_jbd.h>
+#include <linux/jbd.h>
+#include <linux/smp_lock.h>
+#include <linux/highuid.h>
+#include <linux/pagemap.h>
+#include <linux/quotaops.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/ext3_fs_extents.h>
+#include <asm/uaccess.h>
+
+
+static int ext3_ext_check_header(const char *function, struct inode *inode,
+ struct ext3_extent_header *eh)
+{
+ const char *error_msg = NULL;
+
+ if (unlikely(eh->eh_magic != EXT3_EXT_MAGIC)) {
+ error_msg = "invalid magic";
+ goto corrupted;
+ }
+ if (unlikely(eh->eh_max == 0)) {
+ error_msg = "invalid eh_max";
+ goto corrupted;
+ }
+ if (unlikely(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max))) {
+ error_msg = "invalid eh_entries";
+ goto corrupted;
+ }
+ return 0;
+
+corrupted:
+ ext3_error(inode->i_sb, function,
+ "bad header in inode #%lu: %s - magic %x, "
+ "entries %u, max %u, depth %u",
+ inode->i_ino, error_msg, le16_to_cpu(eh->eh_magic),
+ le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max),
+ le16_to_cpu(eh->eh_depth));
+
+ return -EIO;
+}
+
+static handle_t *ext3_ext_journal_restart(handle_t *handle, int needed)
+{
+ int err;
+
+ if (handle->h_buffer_credits > needed)
+ return handle;
+ if (!ext3_journal_extend(handle, needed))
+ return handle;
+ err = ext3_journal_restart(handle, needed);
+
+ return handle;
+}
+
+/*
+ * could return:
+ * - EROFS
+ * - ENOMEM
+ */
+static int ext3_ext_get_access(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path)
+{
+ if (path->p_bh) {
+ /* path points to block */
+ return ext3_journal_get_write_access(handle, path->p_bh);
+ }
+ /* path points to leaf/index in inode body */
+ /* we use in-core data, no need to protect them */
+ return 0;
+}
+
+/*
+ * could return:
+ * - EROFS
+ * - ENOMEM
+ * - EIO
+ */
+static int ext3_ext_dirty(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path)
+{
+ int err;
+ if (path->p_bh) {
+ /* path points to block */
+ err = ext3_journal_dirty_metadata(handle, path->p_bh);
+ } else {
+ /* path points to leaf/index in inode body */
+ err = ext3_mark_inode_dirty(handle, inode);
+ }
+ return err;
+}
+
+static int ext3_ext_find_goal(struct inode *inode,
+ struct ext3_ext_path *path,
+ unsigned long block)
+{
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ unsigned long bg_start;
+ unsigned long colour;
+ int depth;
+
+ if (path) {
+ struct ext3_extent *ex;
+ depth = path->p_depth;
+
+ /* try to predict block placement */
+ if ((ex = path[depth].p_ext))
+ return le32_to_cpu(ex->ee_start)
+ + (block - le32_to_cpu(ex->ee_block));
+
+ /* it looks index is empty
+ * try to find starting from index itself */
+ if (path[depth].p_bh)
+ return path[depth].p_bh->b_blocknr;
+ }
+
+ /* OK. use inode's group */
+ bg_start = (ei->i_block_group * EXT3_BLOCKS_PER_GROUP(inode->i_sb)) +
+ le32_to_cpu(EXT3_SB(inode->i_sb)->s_es->s_first_data_block);
+ colour = (current->pid % 16) *
+ (EXT3_BLOCKS_PER_GROUP(inode->i_sb) / 16);
+ return bg_start + colour + block;
+}
+
+static int
+ext3_ext_new_block(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path,
+ struct ext3_extent *ex, int *err)
+{
+ int goal, newblock;
+
+ goal = ext3_ext_find_goal(inode, path, le32_to_cpu(ex->ee_block));
+ newblock = ext3_new_block(handle, inode, goal, err);
+ return newblock;
+}
+
+static inline int ext3_ext_space_block(struct inode *inode)
+{
+ int size;
+
+ size = (inode->i_sb->s_blocksize - sizeof(struct ext3_extent_header))
+ / sizeof(struct ext3_extent);
+#ifdef AGRESSIVE_TEST
+ if (size > 6)
+ size = 6;
+#endif
+ return size;
+}
+
+static inline int ext3_ext_space_block_idx(struct inode *inode)
+{
+ int size;
+
+ size = (inode->i_sb->s_blocksize - sizeof(struct ext3_extent_header))
+ / sizeof(struct ext3_extent_idx);
+#ifdef AGRESSIVE_TEST
+ if (size > 5)
+ size = 5;
+#endif
+ return size;
+}
+
+static inline int ext3_ext_space_root(struct inode *inode)
+{
+ int size;
+
+ size = sizeof(EXT3_I(inode)->i_data);
+ size -= sizeof(struct ext3_extent_header);
+ size /= sizeof(struct ext3_extent);
+#ifdef AGRESSIVE_TEST
+ if (size > 3)
+ size = 3;
+#endif
+ return size;
+}
+
+static inline int ext3_ext_space_root_idx(struct inode *inode)
+{
+ int size;
+
+ size = sizeof(EXT3_I(inode)->i_data);
+ size -= sizeof(struct ext3_extent_header);
+ size /= sizeof(struct ext3_extent_idx);
+#ifdef AGRESSIVE_TEST
+ if (size > 4)
+ size = 4;
+#endif
+ return size;
+}
+
+#ifdef EXT_DEBUG
+static void ext3_ext_show_path(struct inode *inode, struct ext3_ext_path *path)
+{
+ int k, l = path->p_depth;
+
+ ext_debug("path:");
+ for (k = 0; k <= l; k++, path++) {
+ if (path->p_idx) {
+ ext_debug(" %d->%d", le32_to_cpu(path->p_idx->ei_block),
+ le32_to_cpu(path->p_idx->ei_leaf));
+ } else if (path->p_ext) {
+ ext_debug(" %d:%d:%d",
+ le32_to_cpu(path->p_ext->ee_block),
+ le16_to_cpu(path->p_ext->ee_len),
+ le32_to_cpu(path->p_ext->ee_start));
+ } else
+ ext_debug(" []");
+ }
+ ext_debug("\n");
+}
+
+static void ext3_ext_show_leaf(struct inode *inode, struct ext3_ext_path *path)
+{
+ int depth = ext_depth(inode);
+ struct ext3_extent_header *eh;
+ struct ext3_extent *ex;
+ int i;
+
+ if (!path)
+ return;
+
+ eh = path[depth].p_hdr;
+ ex = EXT_FIRST_EXTENT(eh);
+
+ for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
+ ext_debug("%d:%d:%d ", le32_to_cpu(ex->ee_block),
+ le16_to_cpu(ex->ee_len),
+ le32_to_cpu(ex->ee_start));
+ }
+ ext_debug("\n");
+}
+#else
+#define ext3_ext_show_path(inode,path)
+#define ext3_ext_show_leaf(inode,path)
+#endif
+
+static void ext3_ext_drop_refs(struct ext3_ext_path *path)
+{
+ int depth = path->p_depth;
+ int i;
+
+ for (i = 0; i <= depth; i++, path++)
+ if (path->p_bh) {
+ brelse(path->p_bh);
+ path->p_bh = NULL;
+ }
+}
+
+/*
+ * binary search for closest index by given block
+ */
+static void
+ext3_ext_binsearch_idx(struct inode *inode, struct ext3_ext_path *path, int block)
+{
+ struct ext3_extent_header *eh = path->p_hdr;
+ struct ext3_extent_idx *r, *l, *m;
+
+ BUG_ON(eh->eh_magic != EXT3_EXT_MAGIC);
+ BUG_ON(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max));
+ BUG_ON(le16_to_cpu(eh->eh_entries) <= 0);
+
+ ext_debug("binsearch for %d(idx): ", block);
+
+ l = EXT_FIRST_INDEX(eh) + 1;
+ r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh->eh_entries) - 1;
+ while (l <= r) {
+ m = l + (r - l) / 2;
+ if (block < le32_to_cpu(m->ei_block))
+ r = m - 1;
+ else
+ l = m + 1;
+ ext_debug("%p(%u):%p(%u):%p(%u) ", l, l->ei_block,
+ m, m->ei_block, r, r->ei_block);
+ }
+
+ path->p_idx = l - 1;
+ ext_debug(" -> %d->%d ", le32_to_cpu(path->p_idx->ei_block),
+ le32_to_cpu(path->p_idx->ei_leaf));
+
+#ifdef CHECK_BINSEARCH
+ {
+ struct ext3_extent_idx *chix, *ix;
+ int k;
+
+ chix = ix = EXT_FIRST_INDEX(eh);
+ for (k = 0; k < le16_to_cpu(eh->eh_entries); k++, ix++) {
+ if (k != 0 &&
+ le32_to_cpu(ix->ei_block) <= le32_to_cpu(ix[-1].ei_block)) {
+ printk("k=%d, ix=0x%p, first=0x%p\n", k,
+ ix, EXT_FIRST_INDEX(eh));
+ printk("%u <= %u\n",
+ le32_to_cpu(ix->ei_block),
+ le32_to_cpu(ix[-1].ei_block));
+ }
+ BUG_ON(k && le32_to_cpu(ix->ei_block)
+ <= le32_to_cpu(ix[-1].ei_block));
+ if (block < le32_to_cpu(ix->ei_block))
+ break;
+ chix = ix;
+ }
+ BUG_ON(chix != path->p_idx);
+ }
+#endif
+
+}
+
+/*
+ * binary search for closest extent by given block
+ */
+static void
+ext3_ext_binsearch(struct inode *inode, struct ext3_ext_path *path, int block)
+{
+ struct ext3_extent_header *eh = path->p_hdr;
+ struct ext3_extent *r, *l, *m;
+
+ BUG_ON(eh->eh_magic != EXT3_EXT_MAGIC);
+ BUG_ON(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max));
+
+ if (eh->eh_entries == 0) {
+ /*
+ * this leaf is empty yet:
+ * we get such a leaf in split/add case
+ */
+ return;
+ }
+
+ ext_debug("binsearch for %d: ", block);
+
+ l = EXT_FIRST_EXTENT(eh) + 1;
+ r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh->eh_entries) - 1;
+
+ while (l <= r) {
+ m = l + (r - l) / 2;
+ if (block < le32_to_cpu(m->ee_block))
+ r = m - 1;
+ else
+ l = m + 1;
+ ext_debug("%p(%u):%p(%u):%p(%u) ", l, l->ee_block,
+ m, m->ee_block, r, r->ee_block);
+ }
+
+ path->p_ext = l - 1;
+ ext_debug(" -> %d:%d:%d ",
+ le32_to_cpu(path->p_ext->ee_block),
+ le32_to_cpu(path->p_ext->ee_start),
+ le16_to_cpu(path->p_ext->ee_len));
+
+#ifdef CHECK_BINSEARCH
+ {
+ struct ext3_extent *chex, *ex;
+ int k;
+
+ chex = ex = EXT_FIRST_EXTENT(eh);
+ for (k = 0; k < le16_to_cpu(eh->eh_entries); k++, ex++) {
+ BUG_ON(k && le32_to_cpu(ex->ee_block)
+ <= le32_to_cpu(ex[-1].ee_block));
+ if (block < le32_to_cpu(ex->ee_block))
+ break;
+ chex = ex;
+ }
+ BUG_ON(chex != path->p_ext);
+ }
+#endif
+
+}
+
+int ext3_ext_tree_init(handle_t *handle, struct inode *inode)
+{
+ struct ext3_extent_header *eh;
+
+ eh = ext_inode_hdr(inode);
+ eh->eh_depth = 0;
+ eh->eh_entries = 0;
+ eh->eh_magic = EXT3_EXT_MAGIC;
+ eh->eh_max = cpu_to_le16(ext3_ext_space_root(inode));
+ ext3_mark_inode_dirty(handle, inode);
+ ext3_ext_invalidate_cache(inode);
+ return 0;
+}
+
+struct ext3_ext_path *
+ext3_ext_find_extent(struct inode *inode, int block, struct ext3_ext_path *path)
+{
+ struct ext3_extent_header *eh;
+ struct buffer_head *bh;
+ short int depth, i, ppos = 0, alloc = 0;
+
+ eh = ext_inode_hdr(inode);
+ BUG_ON(eh == NULL);
+ if (ext3_ext_check_header(__FUNCTION__, inode, eh))
+ return ERR_PTR(-EIO);
+
+ i = depth = ext_depth(inode);
+
+ /* account possible depth increase */
+ if (!path) {
+ path = kmalloc(sizeof(struct ext3_ext_path) * (depth + 2),
+ GFP_NOFS);
+ if (!path)
+ return ERR_PTR(-ENOMEM);
+ alloc = 1;
+ }
+ memset(path, 0, sizeof(struct ext3_ext_path) * (depth + 1));
+ path[0].p_hdr = eh;
+
+ /* walk through the tree */
+ while (i) {
+ ext_debug("depth %d: num %d, max %d\n",
+ ppos, le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
+ ext3_ext_binsearch_idx(inode, path + ppos, block);
+ path[ppos].p_block = le32_to_cpu(path[ppos].p_idx->ei_leaf);
+ path[ppos].p_depth = i;
+ path[ppos].p_ext = NULL;
+
+ bh = sb_bread(inode->i_sb, path[ppos].p_block);
+ if (!bh)
+ goto err;
+
+ eh = ext_block_hdr(bh);
+ ppos++;
+ BUG_ON(ppos > depth);
+ path[ppos].p_bh = bh;
+ path[ppos].p_hdr = eh;
+ i--;
+
+ if (ext3_ext_check_header(__FUNCTION__, inode, eh))
+ goto err;
+ }
+
+ path[ppos].p_depth = i;
+ path[ppos].p_hdr = eh;
+ path[ppos].p_ext = NULL;
+ path[ppos].p_idx = NULL;
+
+ if (ext3_ext_check_header(__FUNCTION__, inode, eh))
+ goto err;
+
+ /* find extent */
+ ext3_ext_binsearch(inode, path + ppos, block);
+
+ ext3_ext_show_path(inode, path);
+
+ return path;
+
+err:
+ ext3_ext_drop_refs(path);
+ if (alloc)
+ kfree(path);
+ return ERR_PTR(-EIO);
+}
+
+/*
+ * insert new index [logical;ptr] into the block at cupr
+ * it check where to insert: before curp or after curp
+ */
+static int ext3_ext_insert_index(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *curp,
+ int logical, int ptr)
+{
+ struct ext3_extent_idx *ix;
+ int len, err;
+
+ if ((err = ext3_ext_get_access(handle, inode, curp)))
+ return err;
+
+ BUG_ON(logical == le32_to_cpu(curp->p_idx->ei_block));
+ len = EXT_MAX_INDEX(curp->p_hdr) - curp->p_idx;
+ if (logical > le32_to_cpu(curp->p_idx->ei_block)) {
+ /* insert after */
+ if (curp->p_idx != EXT_LAST_INDEX(curp->p_hdr)) {
+ len = (len - 1) * sizeof(struct ext3_extent_idx);
+ len = len < 0 ? 0 : len;
+ ext_debug("insert new index %d after: %d. "
+ "move %d from 0x%p to 0x%p\n",
+ logical, ptr, len,
+ (curp->p_idx + 1), (curp->p_idx + 2));
+ memmove(curp->p_idx + 2, curp->p_idx + 1, len);
+ }
+ ix = curp->p_idx + 1;
+ } else {
+ /* insert before */
+ len = len * sizeof(struct ext3_extent_idx);
+ len = len < 0 ? 0 : len;
+ ext_debug("insert new index %d before: %d. "
+ "move %d from 0x%p to 0x%p\n",
+ logical, ptr, len,
+ curp->p_idx, (curp->p_idx + 1));
+ memmove(curp->p_idx + 1, curp->p_idx, len);
+ ix = curp->p_idx;
+ }
+
+ ix->ei_block = cpu_to_le32(logical);
+ ix->ei_leaf = cpu_to_le32(ptr);
+ curp->p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(curp->p_hdr->eh_entries)+1);
+
+ BUG_ON(le16_to_cpu(curp->p_hdr->eh_entries)
+ > le16_to_cpu(curp->p_hdr->eh_max));
+ BUG_ON(ix > EXT_LAST_INDEX(curp->p_hdr));
+
+ err = ext3_ext_dirty(handle, inode, curp);
+ ext3_std_error(inode->i_sb, err);
+
+ return err;
+}
+
+/*
+ * routine inserts new subtree into the path, using free index entry
+ * at depth 'at:
+ * - allocates all needed blocks (new leaf and all intermediate index blocks)
+ * - makes decision where to split
+ * - moves remaining extens and index entries (right to the split point)
+ * into the newly allocated blocks
+ * - initialize subtree
+ */
+static int ext3_ext_split(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path,
+ struct ext3_extent *newext, int at)
+{
+ struct buffer_head *bh = NULL;
+ int depth = ext_depth(inode);
+ struct ext3_extent_header *neh;
+ struct ext3_extent_idx *fidx;
+ struct ext3_extent *ex;
+ int i = at, k, m, a;
+ unsigned long newblock, oldblock;
+ __le32 border;
+ int *ablocks = NULL; /* array of allocated blocks */
+ int err = 0;
+
+ /* make decision: where to split? */
+ /* FIXME: now desicion is simplest: at current extent */
+
+ /* if current leaf will be splitted, then we should use
+ * border from split point */
+ BUG_ON(path[depth].p_ext > EXT_MAX_EXTENT(path[depth].p_hdr));
+ if (path[depth].p_ext != EXT_MAX_EXTENT(path[depth].p_hdr)) {
+ border = path[depth].p_ext[1].ee_block;
+ ext_debug("leaf will be splitted."
+ " next leaf starts at %d\n",
+ le32_to_cpu(border));
+ } else {
+ border = newext->ee_block;
+ ext_debug("leaf will be added."
+ " next leaf starts at %d\n",
+ le32_to_cpu(border));
+ }
+
+ /*
+ * if error occurs, then we break processing
+ * and turn filesystem read-only. so, index won't
+ * be inserted and tree will be in consistent
+ * state. next mount will repair buffers too
+ */
+
+ /*
+ * get array to track all allocated blocks
+ * we need this to handle errors and free blocks
+ * upon them
+ */
+ ablocks = kmalloc(sizeof(unsigned long) * depth, GFP_NOFS);
+ if (!ablocks)
+ return -ENOMEM;
+ memset(ablocks, 0, sizeof(unsigned long) * depth);
+
+ /* allocate all needed blocks */
+ ext_debug("allocate %d blocks for indexes/leaf\n", depth - at);
+ for (a = 0; a < depth - at; a++) {
+ newblock = ext3_ext_new_block(handle, inode, path, newext, &err);
+ if (newblock == 0)
+ goto cleanup;
+ ablocks[a] = newblock;
+ }
+
+ /* initialize new leaf */
+ newblock = ablocks[--a];
+ BUG_ON(newblock == 0);
+ bh = sb_getblk(inode->i_sb, newblock);
+ if (!bh) {
+ err = -EIO;
+ goto cleanup;
+ }
+ lock_buffer(bh);
+
+ if ((err = ext3_journal_get_create_access(handle, bh)))
+ goto cleanup;
+
+ neh = ext_block_hdr(bh);
+ neh->eh_entries = 0;
+ neh->eh_max = cpu_to_le16(ext3_ext_space_block(inode));
+ neh->eh_magic = EXT3_EXT_MAGIC;
+ neh->eh_depth = 0;
+ ex = EXT_FIRST_EXTENT(neh);
+
+ /* move remain of path[depth] to the new leaf */
+ BUG_ON(path[depth].p_hdr->eh_entries != path[depth].p_hdr->eh_max);
+ /* start copy from next extent */
+ /* TODO: we could do it by single memmove */
+ m = 0;
+ path[depth].p_ext++;
+ while (path[depth].p_ext <=
+ EXT_MAX_EXTENT(path[depth].p_hdr)) {
+ ext_debug("move %d:%d:%d in new leaf %lu\n",
+ le32_to_cpu(path[depth].p_ext->ee_block),
+ le32_to_cpu(path[depth].p_ext->ee_start),
+ le16_to_cpu(path[depth].p_ext->ee_len),
+ newblock);
+ /*memmove(ex++, path[depth].p_ext++,
+ sizeof(struct ext3_extent));
+ neh->eh_entries++;*/
+ path[depth].p_ext++;
+ m++;
+ }
+ if (m) {
+ memmove(ex, path[depth].p_ext-m, sizeof(struct ext3_extent)*m);
+ neh->eh_entries = cpu_to_le16(le16_to_cpu(neh->eh_entries)+m);
+ }
+
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+
+ if ((err = ext3_journal_dirty_metadata(handle, bh)))
+ goto cleanup;
+ brelse(bh);
+ bh = NULL;
+
+ /* correct old leaf */
+ if (m) {
+ if ((err = ext3_ext_get_access(handle, inode, path + depth)))
+ goto cleanup;
+ path[depth].p_hdr->eh_entries =
+ cpu_to_le16(le16_to_cpu(path[depth].p_hdr->eh_entries)-m);
+ if ((err = ext3_ext_dirty(handle, inode, path + depth)))
+ goto cleanup;
+
+ }
+
+ /* create intermediate indexes */
+ k = depth - at - 1;
+ BUG_ON(k < 0);
+ if (k)
+ ext_debug("create %d intermediate indices\n", k);
+ /* insert new index into current index block */
+ /* current depth stored in i var */
+ i = depth - 1;
+ while (k--) {
+ oldblock = newblock;
+ newblock = ablocks[--a];
+ bh = sb_getblk(inode->i_sb, newblock);
+ if (!bh) {
+ err = -EIO;
+ goto cleanup;
+ }
+ lock_buffer(bh);
+
+ if ((err = ext3_journal_get_create_access(handle, bh)))
+ goto cleanup;
+
+ neh = ext_block_hdr(bh);
+ neh->eh_entries = cpu_to_le16(1);
+ neh->eh_magic = EXT3_EXT_MAGIC;
+ neh->eh_max = cpu_to_le16(ext3_ext_space_block_idx(inode));
+ neh->eh_depth = cpu_to_le16(depth - i);
+ fidx = EXT_FIRST_INDEX(neh);
+ fidx->ei_block = border;
+ fidx->ei_leaf = cpu_to_le32(oldblock);
+
+ ext_debug("int.index at %d (block %lu): %lu -> %lu\n", i,
+ newblock, (unsigned long) le32_to_cpu(border),
+ oldblock);
+ /* copy indexes */
+ m = 0;
+ path[i].p_idx++;
+
+ ext_debug("cur 0x%p, last 0x%p\n", path[i].p_idx,
+ EXT_MAX_INDEX(path[i].p_hdr));
+ BUG_ON(EXT_MAX_INDEX(path[i].p_hdr) !=
+ EXT_LAST_INDEX(path[i].p_hdr));
+ while (path[i].p_idx <= EXT_MAX_INDEX(path[i].p_hdr)) {
+ ext_debug("%d: move %d:%d in new index %lu\n", i,
+ le32_to_cpu(path[i].p_idx->ei_block),
+ le32_to_cpu(path[i].p_idx->ei_leaf),
+ newblock);
+ /*memmove(++fidx, path[i].p_idx++,
+ sizeof(struct ext3_extent_idx));
+ neh->eh_entries++;
+ BUG_ON(neh->eh_entries > neh->eh_max);*/
+ path[i].p_idx++;
+ m++;
+ }
+ if (m) {
+ memmove(++fidx, path[i].p_idx - m,
+ sizeof(struct ext3_extent_idx) * m);
+ neh->eh_entries =
+ cpu_to_le16(le16_to_cpu(neh->eh_entries) + m);
+ }
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+
+ if ((err = ext3_journal_dirty_metadata(handle, bh)))
+ goto cleanup;
+ brelse(bh);
+ bh = NULL;
+
+ /* correct old index */
+ if (m) {
+ err = ext3_ext_get_access(handle, inode, path + i);
+ if (err)
+ goto cleanup;
+ path[i].p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(path[i].p_hdr->eh_entries)-m);
+ err = ext3_ext_dirty(handle, inode, path + i);
+ if (err)
+ goto cleanup;
+ }
+
+ i--;
+ }
+
+ /* insert new index */
+ if (err)
+ goto cleanup;
+
+ err = ext3_ext_insert_index(handle, inode, path + at,
+ le32_to_cpu(border), newblock);
+
+cleanup:
+ if (bh) {
+ if (buffer_locked(bh))
+ unlock_buffer(bh);
+ brelse(bh);
+ }
+
+ if (err) {
+ /* free all allocated blocks in error case */
+ for (i = 0; i < depth; i++) {
+ if (!ablocks[i])
+ continue;
+ ext3_free_blocks(handle, inode, ablocks[i], 1);
+ }
+ }
+ kfree(ablocks);
+
+ return err;
+}
+
+/*
+ * routine implements tree growing procedure:
+ * - allocates new block
+ * - moves top-level data (index block or leaf) into the new block
+ * - initialize new top-level, creating index that points to the
+ * just created block
+ */
+static int ext3_ext_grow_indepth(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path,
+ struct ext3_extent *newext)
+{
+ struct ext3_ext_path *curp = path;
+ struct ext3_extent_header *neh;
+ struct ext3_extent_idx *fidx;
+ struct buffer_head *bh;
+ unsigned long newblock;
+ int err = 0;
+
+ newblock = ext3_ext_new_block(handle, inode, path, newext, &err);
+ if (newblock == 0)
+ return err;
+
+ bh = sb_getblk(inode->i_sb, newblock);
+ if (!bh) {
+ err = -EIO;
+ ext3_std_error(inode->i_sb, err);
+ return err;
+ }
+ lock_buffer(bh);
+
+ if ((err = ext3_journal_get_create_access(handle, bh))) {
+ unlock_buffer(bh);
+ goto out;
+ }
+
+ /* move top-level index/leaf into new block */
+ memmove(bh->b_data, curp->p_hdr, sizeof(EXT3_I(inode)->i_data));
+
+ /* set size of new block */
+ neh = ext_block_hdr(bh);
+ /* old root could have indexes or leaves
+ * so calculate e_max right way */
+ if (ext_depth(inode))
+ neh->eh_max = cpu_to_le16(ext3_ext_space_block_idx(inode));
+ else
+ neh->eh_max = cpu_to_le16(ext3_ext_space_block(inode));
+ neh->eh_magic = EXT3_EXT_MAGIC;
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+
+ if ((err = ext3_journal_dirty_metadata(handle, bh)))
+ goto out;
+
+ /* create index in new top-level index: num,max,pointer */
+ if ((err = ext3_ext_get_access(handle, inode, curp)))
+ goto out;
+
+ curp->p_hdr->eh_magic = EXT3_EXT_MAGIC;
+ curp->p_hdr->eh_max = cpu_to_le16(ext3_ext_space_root_idx(inode));
+ curp->p_hdr->eh_entries = cpu_to_le16(1);
+ curp->p_idx = EXT_FIRST_INDEX(curp->p_hdr);
+ /* FIXME: it works, but actually path[0] can be index */
+ curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block;
+ curp->p_idx->ei_leaf = cpu_to_le32(newblock);
+
+ neh = ext_inode_hdr(inode);
+ fidx = EXT_FIRST_INDEX(neh);
+ ext_debug("new root: num %d(%d), lblock %d, ptr %d\n",
+ le16_to_cpu(neh->eh_entries), le16_to_cpu(neh->eh_max),
+ le32_to_cpu(fidx->ei_block), le32_to_cpu(fidx->ei_leaf));
+
+ neh->eh_depth = cpu_to_le16(path->p_depth + 1);
+ err = ext3_ext_dirty(handle, inode, curp);
+out:
+ brelse(bh);
+
+ return err;
+}
+
+/*
+ * routine finds empty index and adds new leaf. if no free index found
+ * then it requests in-depth growing
+ */
+static int ext3_ext_create_new_leaf(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path,
+ struct ext3_extent *newext)
+{
+ struct ext3_ext_path *curp;
+ int depth, i, err = 0;
+
+repeat:
+ i = depth = ext_depth(inode);
+
+ /* walk up to the tree and look for free index entry */
+ curp = path + depth;
+ while (i > 0 && !EXT_HAS_FREE_INDEX(curp)) {
+ i--;
+ curp--;
+ }
+
+ /* we use already allocated block for index block
+ * so, subsequent data blocks should be contigoues */
+ if (EXT_HAS_FREE_INDEX(curp)) {
+ /* if we found index with free entry, then use that
+ * entry: create all needed subtree and add new leaf */
+ err = ext3_ext_split(handle, inode, path, newext, i);
+
+ /* refill path */
+ ext3_ext_drop_refs(path);
+ path = ext3_ext_find_extent(inode,
+ le32_to_cpu(newext->ee_block),
+ path);
+ if (IS_ERR(path))
+ err = PTR_ERR(path);
+ } else {
+ /* tree is full, time to grow in depth */
+ err = ext3_ext_grow_indepth(handle, inode, path, newext);
+ if (err)
+ goto out;
+
+ /* refill path */
+ ext3_ext_drop_refs(path);
+ path = ext3_ext_find_extent(inode,
+ le32_to_cpu(newext->ee_block),
+ path);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ goto out;
+ }
+
+ /*
+ * only first (depth 0 -> 1) produces free space
+ * in all other cases we have to split growed tree
+ */
+ depth = ext_depth(inode);
+ if (path[depth].p_hdr->eh_entries == path[depth].p_hdr->eh_max) {
+ /* now we need split */
+ goto repeat;
+ }
+ }
+
+out:
+ return err;
+}
+
+/*
+ * returns allocated block in subsequent extent or EXT_MAX_BLOCK
+ * NOTE: it consider block number from index entry as
+ * allocated block. thus, index entries have to be consistent
+ * with leafs
+ */
+static unsigned long
+ext3_ext_next_allocated_block(struct ext3_ext_path *path)
+{
+ int depth;
+
+ BUG_ON(path == NULL);
+ depth = path->p_depth;
+
+ if (depth == 0 && path->p_ext == NULL)
+ return EXT_MAX_BLOCK;
+
+ while (depth >= 0) {
+ if (depth == path->p_depth) {
+ /* leaf */
+ if (path[depth].p_ext !=
+ EXT_LAST_EXTENT(path[depth].p_hdr))
+ return le32_to_cpu(path[depth].p_ext[1].ee_block);
+ } else {
+ /* index */
+ if (path[depth].p_idx !=
+ EXT_LAST_INDEX(path[depth].p_hdr))
+ return le32_to_cpu(path[depth].p_idx[1].ei_block);
+ }
+ depth--;
+ }
+
+ return EXT_MAX_BLOCK;
+}
+
+/*
+ * returns first allocated block from next leaf or EXT_MAX_BLOCK
+ */
+static unsigned ext3_ext_next_leaf_block(struct inode *inode,
+ struct ext3_ext_path *path)
+{
+ int depth;
+
+ BUG_ON(path == NULL);
+ depth = path->p_depth;
+
+ /* zero-tree has no leaf blocks at all */
+ if (depth == 0)
+ return EXT_MAX_BLOCK;
+
+ /* go to index block */
+ depth--;
+
+ while (depth >= 0) {
+ if (path[depth].p_idx !=
+ EXT_LAST_INDEX(path[depth].p_hdr))
+ return le32_to_cpu(path[depth].p_idx[1].ei_block);
+ depth--;
+ }
+
+ return EXT_MAX_BLOCK;
+}
+
+/*
+ * if leaf gets modified and modified extent is first in the leaf
+ * then we have to correct all indexes above
+ * TODO: do we need to correct tree in all cases?
+ */
+int ext3_ext_correct_indexes(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path)
+{
+ struct ext3_extent_header *eh;
+ int depth = ext_depth(inode);
+ struct ext3_extent *ex;
+ __le32 border;
+ int k, err = 0;
+
+ eh = path[depth].p_hdr;
+ ex = path[depth].p_ext;
+ BUG_ON(ex == NULL);
+ BUG_ON(eh == NULL);
+
+ if (depth == 0) {
+ /* there is no tree at all */
+ return 0;
+ }
+
+ if (ex != EXT_FIRST_EXTENT(eh)) {
+ /* we correct tree if first leaf got modified only */
+ return 0;
+ }
+
+ /*
+ * TODO: we need correction if border is smaller then current one
+ */
+ k = depth - 1;
+ border = path[depth].p_ext->ee_block;
+ if ((err = ext3_ext_get_access(handle, inode, path + k)))
+ return err;
+ path[k].p_idx->ei_block = border;
+ if ((err = ext3_ext_dirty(handle, inode, path + k)))
+ return err;
+
+ while (k--) {
+ /* change all left-side indexes */
+ if (path[k+1].p_idx != EXT_FIRST_INDEX(path[k+1].p_hdr))
+ break;
+ if ((err = ext3_ext_get_access(handle, inode, path + k)))
+ break;
+ path[k].p_idx->ei_block = border;
+ if ((err = ext3_ext_dirty(handle, inode, path + k)))
+ break;
+ }
+
+ return err;
+}
+
+static int inline
+ext3_can_extents_be_merged(struct inode *inode, struct ext3_extent *ex1,
+ struct ext3_extent *ex2)
+{
+ /* FIXME: 48bit support */
+ if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len)
+ != le32_to_cpu(ex2->ee_block))
+ return 0;
+
+#ifdef AGRESSIVE_TEST
+ if (le16_to_cpu(ex1->ee_len) >= 4)
+ return 0;
+#endif
+
+ if (le32_to_cpu(ex1->ee_start) + le16_to_cpu(ex1->ee_len)
+ == le32_to_cpu(ex2->ee_start))
+ return 1;
+ return 0;
+}
+
+/*
+ * this routine tries to merge requsted extent into the existing
+ * extent or inserts requested extent as new one into the tree,
+ * creating new leaf in no-space case
+ */
+int ext3_ext_insert_extent(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path,
+ struct ext3_extent *newext)
+{
+ struct ext3_extent_header * eh;
+ struct ext3_extent *ex, *fex;
+ struct ext3_extent *nearex; /* nearest extent */
+ struct ext3_ext_path *npath = NULL;
+ int depth, len, err, next;
+
+ BUG_ON(newext->ee_len == 0);
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+ BUG_ON(path[depth].p_hdr == NULL);
+
+ /* try to insert block into found extent and return */
+ if (ex && ext3_can_extents_be_merged(inode, ex, newext)) {
+ ext_debug("append %d block to %d:%d (from %d)\n",
+ le16_to_cpu(newext->ee_len),
+ le32_to_cpu(ex->ee_block),
+ le16_to_cpu(ex->ee_len),
+ le32_to_cpu(ex->ee_start));
+ if ((err = ext3_ext_get_access(handle, inode, path + depth)))
+ return err;
+ ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
+ + le16_to_cpu(newext->ee_len));
+ eh = path[depth].p_hdr;
+ nearex = ex;
+ goto merge;
+ }
+
+repeat:
+ depth = ext_depth(inode);
+ eh = path[depth].p_hdr;
+ if (le16_to_cpu(eh->eh_entries) < le16_to_cpu(eh->eh_max))
+ goto has_space;
+
+ /* probably next leaf has space for us? */
+ fex = EXT_LAST_EXTENT(eh);
+ next = ext3_ext_next_leaf_block(inode, path);
+ if (le32_to_cpu(newext->ee_block) > le32_to_cpu(fex->ee_block)
+ && next != EXT_MAX_BLOCK) {
+ ext_debug("next leaf block - %d\n", next);
+ BUG_ON(npath != NULL);
+ npath = ext3_ext_find_extent(inode, next, NULL);
+ if (IS_ERR(npath))
+ return PTR_ERR(npath);
+ BUG_ON(npath->p_depth != path->p_depth);
+ eh = npath[depth].p_hdr;
+ if (le16_to_cpu(eh->eh_entries) < le16_to_cpu(eh->eh_max)) {
+ ext_debug("next leaf isnt full(%d)\n",
+ le16_to_cpu(eh->eh_entries));
+ path = npath;
+ goto repeat;
+ }
+ ext_debug("next leaf has no free space(%d,%d)\n",
+ le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
+ }
+
+ /*
+ * there is no free space in found leaf
+ * we're gonna add new leaf in the tree
+ */
+ err = ext3_ext_create_new_leaf(handle, inode, path, newext);
+ if (err)
+ goto cleanup;
+ depth = ext_depth(inode);
+ eh = path[depth].p_hdr;
+
+has_space:
+ nearex = path[depth].p_ext;
+
+ if ((err = ext3_ext_get_access(handle, inode, path + depth)))
+ goto cleanup;
+
+ if (!nearex) {
+ /* there is no extent in this leaf, create first one */
+ ext_debug("first extent in the leaf: %d:%d:%d\n",
+ le32_to_cpu(newext->ee_block),
+ le32_to_cpu(newext->ee_start),
+ le16_to_cpu(newext->ee_len));
+ path[depth].p_ext = EXT_FIRST_EXTENT(eh);
+ } else if (le32_to_cpu(newext->ee_block)
+ > le32_to_cpu(nearex->ee_block)) {
+/* BUG_ON(newext->ee_block == nearex->ee_block); */
+ if (nearex != EXT_LAST_EXTENT(eh)) {
+ len = EXT_MAX_EXTENT(eh) - nearex;
+ len = (len - 1) * sizeof(struct ext3_extent);
+ len = len < 0 ? 0 : len;
+ ext_debug("insert %d:%d:%d after: nearest 0x%p, "
+ "move %d from 0x%p to 0x%p\n",
+ le32_to_cpu(newext->ee_block),
+ le32_to_cpu(newext->ee_start),
+ le16_to_cpu(newext->ee_len),
+ nearex, len, nearex + 1, nearex + 2);
+ memmove(nearex + 2, nearex + 1, len);
+ }
+ path[depth].p_ext = nearex + 1;
+ } else {
+ BUG_ON(newext->ee_block == nearex->ee_block);
+ len = (EXT_MAX_EXTENT(eh) - nearex) * sizeof(struct ext3_extent);
+ len = len < 0 ? 0 : len;
+ ext_debug("insert %d:%d:%d before: nearest 0x%p, "
+ "move %d from 0x%p to 0x%p\n",
+ le32_to_cpu(newext->ee_block),
+ le32_to_cpu(newext->ee_start),
+ le16_to_cpu(newext->ee_len),
+ nearex, len, nearex + 1, nearex + 2);
+ memmove(nearex + 1, nearex, len);
+ path[depth].p_ext = nearex;
+ }
+
+ eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)+1);
+ nearex = path[depth].p_ext;
+ nearex->ee_block = newext->ee_block;
+ nearex->ee_start = newext->ee_start;
+ nearex->ee_len = newext->ee_len;
+ /* FIXME: support for large fs */
+ nearex->ee_start_hi = 0;
+
+merge:
+ /* try to merge extents to the right */
+ while (nearex < EXT_LAST_EXTENT(eh)) {
+ if (!ext3_can_extents_be_merged(inode, nearex, nearex + 1))
+ break;
+ /* merge with next extent! */
+ nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
+ + le16_to_cpu(nearex[1].ee_len));
+ if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
+ len = (EXT_LAST_EXTENT(eh) - nearex - 1)
+ * sizeof(struct ext3_extent);
+ memmove(nearex + 1, nearex + 2, len);
+ }
+ eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
+ BUG_ON(eh->eh_entries == 0);
+ }
+
+ /* try to merge extents to the left */
+
+ /* time to correct all indexes above */
+ err = ext3_ext_correct_indexes(handle, inode, path);
+ if (err)
+ goto cleanup;
+
+ err = ext3_ext_dirty(handle, inode, path + depth);
+
+cleanup:
+ if (npath) {
+ ext3_ext_drop_refs(npath);
+ kfree(npath);
+ }
+ ext3_ext_tree_changed(inode);
+ ext3_ext_invalidate_cache(inode);
+ return err;
+}
+
+int ext3_ext_walk_space(struct inode *inode, unsigned long block,
+ unsigned long num, ext_prepare_callback func,
+ void *cbdata)
+{
+ struct ext3_ext_path *path = NULL;
+ struct ext3_ext_cache cbex;
+ struct ext3_extent *ex;
+ unsigned long next, start = 0, end = 0;
+ unsigned long last = block + num;
+ int depth, exists, err = 0;
+
+ BUG_ON(func == NULL);
+ BUG_ON(inode == NULL);
+
+ while (block < last && block != EXT_MAX_BLOCK) {
+ num = last - block;
+ /* find extent for this block */
+ path = ext3_ext_find_extent(inode, block, path);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ path = NULL;
+ break;
+ }
+
+ depth = ext_depth(inode);
+ BUG_ON(path[depth].p_hdr == NULL);
+ ex = path[depth].p_ext;
+ next = ext3_ext_next_allocated_block(path);
+
+ exists = 0;
+ if (!ex) {
+ /* there is no extent yet, so try to allocate
+ * all requested space */
+ start = block;
+ end = block + num;
+ } else if (le32_to_cpu(ex->ee_block) > block) {
+ /* need to allocate space before found extent */
+ start = block;
+ end = le32_to_cpu(ex->ee_block);
+ if (block + num < end)
+ end = block + num;
+ } else if (block >=
+ le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+ /* need to allocate space after found extent */
+ start = block;
+ end = block + num;
+ if (end >= next)
+ end = next;
+ } else if (block >= le32_to_cpu(ex->ee_block)) {
+ /*
+ * some part of requested space is covered
+ * by found extent
+ */
+ start = block;
+ end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+ if (block + num < end)
+ end = block + num;
+ exists = 1;
+ } else {
+ BUG();
+ }
+ BUG_ON(end <= start);
+
+ if (!exists) {
+ cbex.ec_block = start;
+ cbex.ec_len = end - start;
+ cbex.ec_start = 0;
+ cbex.ec_type = EXT3_EXT_CACHE_GAP;
+ } else {
+ cbex.ec_block = le32_to_cpu(ex->ee_block);
+ cbex.ec_len = le16_to_cpu(ex->ee_len);
+ cbex.ec_start = le32_to_cpu(ex->ee_start);
+ cbex.ec_type = EXT3_EXT_CACHE_EXTENT;
+ }
+
+ BUG_ON(cbex.ec_len == 0);
+ err = func(inode, path, &cbex, cbdata);
+ ext3_ext_drop_refs(path);
+
+ if (err < 0)
+ break;
+ if (err == EXT_REPEAT)
+ continue;
+ else if (err == EXT_BREAK) {
+ err = 0;
+ break;
+ }
+
+ if (ext_depth(inode) != depth) {
+ /* depth was changed. we have to realloc path */
+ kfree(path);
+ path = NULL;
+ }
+
+ block = cbex.ec_block + cbex.ec_len;
+ }
+
+ if (path) {
+ ext3_ext_drop_refs(path);
+ kfree(path);
+ }
+
+ return err;
+}
+
+static inline void
+ext3_ext_put_in_cache(struct inode *inode, __u32 block,
+ __u32 len, __u32 start, int type)
+{
+ struct ext3_ext_cache *cex;
+ BUG_ON(len == 0);
+ cex = &EXT3_I(inode)->i_cached_extent;
+ cex->ec_type = type;
+ cex->ec_block = block;
+ cex->ec_len = len;
+ cex->ec_start = start;
+}
+
+/*
+ * this routine calculate boundaries of the gap requested block fits into
+ * and cache this gap
+ */
+static inline void
+ext3_ext_put_gap_in_cache(struct inode *inode, struct ext3_ext_path *path,
+ unsigned long block)
+{
+ int depth = ext_depth(inode);
+ unsigned long lblock, len;
+ struct ext3_extent *ex;
+
+ ex = path[depth].p_ext;
+ if (ex == NULL) {
+ /* there is no extent yet, so gap is [0;-] */
+ lblock = 0;
+ len = EXT_MAX_BLOCK;
+ ext_debug("cache gap(whole file):");
+ } else if (block < le32_to_cpu(ex->ee_block)) {
+ lblock = block;
+ len = le32_to_cpu(ex->ee_block) - block;
+ ext_debug("cache gap(before): %lu [%lu:%lu]",
+ (unsigned long) block,
+ (unsigned long) le32_to_cpu(ex->ee_block),
+ (unsigned long) le16_to_cpu(ex->ee_len));
+ } else if (block >= le32_to_cpu(ex->ee_block)
+ + le16_to_cpu(ex->ee_len)) {
+ lblock = le32_to_cpu(ex->ee_block)
+ + le16_to_cpu(ex->ee_len);
+ len = ext3_ext_next_allocated_block(path);
+ ext_debug("cache gap(after): [%lu:%lu] %lu",
+ (unsigned long) le32_to_cpu(ex->ee_block),
+ (unsigned long) le16_to_cpu(ex->ee_len),
+ (unsigned long) block);
+ BUG_ON(len == lblock);
+ len = len - lblock;
+ } else {
+ lblock = len = 0;
+ BUG();
+ }
+
+ ext_debug(" -> %lu:%lu\n", (unsigned long) lblock, len);
+ ext3_ext_put_in_cache(inode, lblock, len, 0, EXT3_EXT_CACHE_GAP);
+}
+
+static inline int
+ext3_ext_in_cache(struct inode *inode, unsigned long block,
+ struct ext3_extent *ex)
+{
+ struct ext3_ext_cache *cex;
+
+ cex = &EXT3_I(inode)->i_cached_extent;
+
+ /* has cache valid data? */
+ if (cex->ec_type == EXT3_EXT_CACHE_NO)
+ return EXT3_EXT_CACHE_NO;
+
+ BUG_ON(cex->ec_type != EXT3_EXT_CACHE_GAP &&
+ cex->ec_type != EXT3_EXT_CACHE_EXTENT);
+ if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) {
+ ex->ee_block = cpu_to_le32(cex->ec_block);
+ ex->ee_start = cpu_to_le32(cex->ec_start);
+ ex->ee_len = cpu_to_le16(cex->ec_len);
+ ext_debug("%lu cached by %lu:%lu:%lu\n",
+ (unsigned long) block,
+ (unsigned long) cex->ec_block,
+ (unsigned long) cex->ec_len,
+ (unsigned long) cex->ec_start);
+ return cex->ec_type;
+ }
+
+ /* not in cache */
+ return EXT3_EXT_CACHE_NO;
+}
+
+/*
+ * routine removes index from the index block
+ * it's used in truncate case only. thus all requests are for
+ * last index in the block only
+ */
+int ext3_ext_rm_idx(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path)
+{
+ struct buffer_head *bh;
+ int err;
+ unsigned long leaf;
+
+ /* free index block */
+ path--;
+ leaf = le32_to_cpu(path->p_idx->ei_leaf);
+ BUG_ON(path->p_hdr->eh_entries == 0);
+ if ((err = ext3_ext_get_access(handle, inode, path)))
+ return err;
+ path->p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(path->p_hdr->eh_entries)-1);
+ if ((err = ext3_ext_dirty(handle, inode, path)))
+ return err;
+ ext_debug("index is empty, remove it, free block %lu\n", leaf);
+ bh = sb_find_get_block(inode->i_sb, leaf);
+ ext3_forget(handle, 1, inode, bh, leaf);
+ ext3_free_blocks(handle, inode, leaf, 1);
+ return err;
+}
+
+/*
+ * This routine returns max. credits extent tree can consume.
+ * It should be OK for low-performance paths like ->writepage()
+ * To allow many writing process to fit a single transaction,
+ * caller should calculate credits under truncate_mutex and
+ * pass actual path.
+ */
+int inline ext3_ext_calc_credits_for_insert(struct inode *inode,
+ struct ext3_ext_path *path)
+{
+ int depth, needed;
+
+ if (path) {
+ /* probably there is space in leaf? */
+ depth = ext_depth(inode);
+ if (le16_to_cpu(path[depth].p_hdr->eh_entries)
+ < le16_to_cpu(path[depth].p_hdr->eh_max))
+ return 1;
+ }
+
+ /*
+ * given 32bit logical block (4294967296 blocks), max. tree
+ * can be 4 levels in depth -- 4 * 340^4 == 53453440000.
+ * let's also add one more level for imbalance.
+ */
+ depth = 5;
+
+ /* allocation of new data block(s) */
+ needed = 2;
+
+ /*
+ * tree can be full, so it'd need to grow in depth:
+ * allocation + old root + new root
+ */
+ needed += 2 + 1 + 1;
+
+ /*
+ * Index split can happen, we'd need:
+ * allocate intermediate indexes (bitmap + group)
+ * + change two blocks at each level, but root (already included)
+ */
+ needed = (depth * 2) + (depth * 2);
+
+ /* any allocation modifies superblock */
+ needed += 1;
+
+ return needed;
+}
+
+static int ext3_remove_blocks(handle_t *handle, struct inode *inode,
+ struct ext3_extent *ex,
+ unsigned long from, unsigned long to)
+{
+ struct buffer_head *bh;
+ int i;
+
+#ifdef EXTENTS_STATS
+ {
+ struct ext3_sb_info *sbi = EXT3_SB(inode->i_sb);
+ unsigned short ee_len = le16_to_cpu(ex->ee_len);
+ spin_lock(&sbi->s_ext_stats_lock);
+ sbi->s_ext_blocks += ee_len;
+ sbi->s_ext_extents++;
+ if (ee_len < sbi->s_ext_min)
+ sbi->s_ext_min = ee_len;
+ if (ee_len > sbi->s_ext_max)
+ sbi->s_ext_max = ee_len;
+ if (ext_depth(inode) > sbi->s_depth_max)
+ sbi->s_depth_max = ext_depth(inode);
+ spin_unlock(&sbi->s_ext_stats_lock);
+ }
+#endif
+ if (from >= le32_to_cpu(ex->ee_block)
+ && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+ /* tail removal */
+ unsigned long num, start;
+ num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
+ start = le32_to_cpu(ex->ee_start) + le16_to_cpu(ex->ee_len) - num;
+ ext_debug("free last %lu blocks starting %lu\n", num, start);
+ for (i = 0; i < num; i++) {
+ bh = sb_find_get_block(inode->i_sb, start + i);
+ ext3_forget(handle, 0, inode, bh, start + i);
+ }
+ ext3_free_blocks(handle, inode, start, num);
+ } else if (from == le32_to_cpu(ex->ee_block)
+ && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+ printk("strange request: removal %lu-%lu from %u:%u\n",
+ from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+ } else {
+ printk("strange request: removal(2) %lu-%lu from %u:%u\n",
+ from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+ }
+ return 0;
+}
+
+static int
+ext3_ext_rm_leaf(handle_t *handle, struct inode *inode,
+ struct ext3_ext_path *path, unsigned long start)
+{
+ int err = 0, correct_index = 0;
+ int depth = ext_depth(inode), credits;
+ struct ext3_extent_header *eh;
+ unsigned a, b, block, num;
+ unsigned long ex_ee_block;
+ unsigned short ex_ee_len;
+ struct ext3_extent *ex;
+
+ ext_debug("truncate since %lu in leaf\n", start);
+ if (!path[depth].p_hdr)
+ path[depth].p_hdr = ext_block_hdr(path[depth].p_bh);
+ eh = path[depth].p_hdr;
+ BUG_ON(eh == NULL);
+ BUG_ON(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max));
+ BUG_ON(eh->eh_magic != EXT3_EXT_MAGIC);
+
+ /* find where to start removing */
+ ex = EXT_LAST_EXTENT(eh);
+
+ ex_ee_block = le32_to_cpu(ex->ee_block);
+ ex_ee_len = le16_to_cpu(ex->ee_len);
+
+ while (ex >= EXT_FIRST_EXTENT(eh) &&
+ ex_ee_block + ex_ee_len > start) {
+ ext_debug("remove ext %lu:%u\n", ex_ee_block, ex_ee_len);
+ path[depth].p_ext = ex;
+
+ a = ex_ee_block > start ? ex_ee_block : start;
+ b = ex_ee_block + ex_ee_len - 1 < EXT_MAX_BLOCK ?
+ ex_ee_block + ex_ee_len - 1 : EXT_MAX_BLOCK;
+
+ ext_debug(" border %u:%u\n", a, b);
+
+ if (a != ex_ee_block && b != ex_ee_block + ex_ee_len - 1) {
+ block = 0;
+ num = 0;
+ BUG();
+ } else if (a != ex_ee_block) {
+ /* remove tail of the extent */
+ block = ex_ee_block;
+ num = a - block;
+ } else if (b != ex_ee_block + ex_ee_len - 1) {
+ /* remove head of the extent */
+ block = a;
+ num = b - a;
+ /* there is no "make a hole" API yet */
+ BUG();
+ } else {
+ /* remove whole extent: excellent! */
+ block = ex_ee_block;
+ num = 0;
+ BUG_ON(a != ex_ee_block);
+ BUG_ON(b != ex_ee_block + ex_ee_len - 1);
+ }
+
+ /* at present, extent can't cross block group */
+ /* leaf + bitmap + group desc + sb + inode */
+ credits = 5;
+ if (ex == EXT_FIRST_EXTENT(eh)) {
+ correct_index = 1;
+ credits += (ext_depth(inode)) + 1;
+ }
+#ifdef CONFIG_QUOTA
+ credits += 2 * EXT3_QUOTA_TRANS_BLOCKS(inode->i_sb);
+#endif
+
+ handle = ext3_ext_journal_restart(handle, credits);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto out;
+ }
+
+ err = ext3_ext_get_access(handle, inode, path + depth);
+ if (err)
+ goto out;
+
+ err = ext3_remove_blocks(handle, inode, ex, a, b);
+ if (err)
+ goto out;
+
+ if (num == 0) {
+ /* this extent is removed entirely mark slot unused */
+ ex->ee_start = 0;
+ eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
+ }
+
+ ex->ee_block = cpu_to_le32(block);
+ ex->ee_len = cpu_to_le16(num);
+
+ err = ext3_ext_dirty(handle, inode, path + depth);
+ if (err)
+ goto out;
+
+ ext_debug("new extent: %u:%u:%u\n", block, num,
+ le32_to_cpu(ex->ee_start));
+ ex--;
+ ex_ee_block = le32_to_cpu(ex->ee_block);
+ ex_ee_len = le16_to_cpu(ex->ee_len);
+ }
+
+ if (correct_index && eh->eh_entries)
+ err = ext3_ext_correct_indexes(handle, inode, path);
+
+ /* if this leaf is free, then we should
+ * remove it from index block above */
+ if (err == 0 && eh->eh_entries == 0 && path[depth].p_bh != NULL)
+ err = ext3_ext_rm_idx(handle, inode, path + depth);
+
+out:
+ return err;
+}
+
+/*
+ * returns 1 if current index have to be freed (even partial)
+ */
+static int inline
+ext3_ext_more_to_rm(struct ext3_ext_path *path)
+{
+ BUG_ON(path->p_idx == NULL);
+
+ if (path->p_idx < EXT_FIRST_INDEX(path->p_hdr))
+ return 0;
+
+ /*
+ * if truncate on deeper level happened it it wasn't partial
+ * so we have to consider current index for truncation
+ */
+ if (le16_to_cpu(path->p_hdr->eh_entries) == path->p_block)
+ return 0;
+ return 1;
+}
+
+int ext3_ext_remove_space(struct inode *inode, unsigned long start)
+{
+ struct super_block *sb = inode->i_sb;
+ int depth = ext_depth(inode);
+ struct ext3_ext_path *path;
+ handle_t *handle;
+ int i = 0, err = 0;
+
+ ext_debug("truncate since %lu\n", start);
+
+ /* probably first extent we're gonna free will be last in block */
+ handle = ext3_journal_start(inode, depth + 1);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ ext3_ext_invalidate_cache(inode);
+
+ /*
+ * we start scanning from right side freeing all the blocks
+ * after i_size and walking into the deep
+ */
+ path = kmalloc(sizeof(struct ext3_ext_path) * (depth + 1), GFP_KERNEL);
+ if (path == NULL) {
+ ext3_journal_stop(handle);
+ return -ENOMEM;
+ }
+ memset(path, 0, sizeof(struct ext3_ext_path) * (depth + 1));
+ path[0].p_hdr = ext_inode_hdr(inode);
+ if (ext3_ext_check_header(__FUNCTION__, inode, path[0].p_hdr)) {
+ err = -EIO;
+ goto out;
+ }
+ path[0].p_depth = depth;
+
+ while (i >= 0 && err == 0) {
+ if (i == depth) {
+ /* this is leaf block */
+ err = ext3_ext_rm_leaf(handle, inode, path, start);
+ /* root level have p_bh == NULL, brelse() eats this */
+ brelse(path[i].p_bh);
+ path[i].p_bh = NULL;
+ i--;
+ continue;
+ }
+
+ /* this is index block */
+ if (!path[i].p_hdr) {
+ ext_debug("initialize header\n");
+ path[i].p_hdr = ext_block_hdr(path[i].p_bh);
+ if (ext3_ext_check_header(__FUNCTION__, inode,
+ path[i].p_hdr)) {
+ err = -EIO;
+ goto out;
+ }
+ }
+
+ BUG_ON(le16_to_cpu(path[i].p_hdr->eh_entries)
+ > le16_to_cpu(path[i].p_hdr->eh_max));
+ BUG_ON(path[i].p_hdr->eh_magic != EXT3_EXT_MAGIC);
+
+ if (!path[i].p_idx) {
+ /* this level hasn't touched yet */
+ path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr);
+ path[i].p_block = le16_to_cpu(path[i].p_hdr->eh_entries)+1;
+ ext_debug("init index ptr: hdr 0x%p, num %d\n",
+ path[i].p_hdr,
+ le16_to_cpu(path[i].p_hdr->eh_entries));
+ } else {
+ /* we've already was here, see at next index */
+ path[i].p_idx--;
+ }
+
+ ext_debug("level %d - index, first 0x%p, cur 0x%p\n",
+ i, EXT_FIRST_INDEX(path[i].p_hdr),
+ path[i].p_idx);
+ if (ext3_ext_more_to_rm(path + i)) {
+ /* go to the next level */
+ ext_debug("move to level %d (block %d)\n",
+ i + 1, le32_to_cpu(path[i].p_idx->ei_leaf));
+ memset(path + i + 1, 0, sizeof(*path));
+ path[i+1].p_bh =
+ sb_bread(sb, le32_to_cpu(path[i].p_idx->ei_leaf));
+ if (!path[i+1].p_bh) {
+ /* should we reset i_size? */
+ err = -EIO;
+ break;
+ }
+
+ /* put actual number of indexes to know is this
+ * number got changed at the next iteration */
+ path[i].p_block = le16_to_cpu(path[i].p_hdr->eh_entries);
+ i++;
+ } else {
+ /* we finish processing this index, go up */
+ if (path[i].p_hdr->eh_entries == 0 && i > 0) {
+ /* index is empty, remove it
+ * handle must be already prepared by the
+ * truncatei_leaf() */
+ err = ext3_ext_rm_idx(handle, inode, path + i);
+ }
+ /* root level have p_bh == NULL, brelse() eats this */
+ brelse(path[i].p_bh);
+ path[i].p_bh = NULL;
+ i--;
+ ext_debug("return to level %d\n", i);
+ }
+ }
+
+ /* TODO: flexible tree reduction should be here */
+ if (path->p_hdr->eh_entries == 0) {
+ /*
+ * truncate to zero freed all the tree
+ * so, we need to correct eh_depth
+ */
+ err = ext3_ext_get_access(handle, inode, path);
+ if (err == 0) {
+ ext_inode_hdr(inode)->eh_depth = 0;
+ ext_inode_hdr(inode)->eh_max =
+ cpu_to_le16(ext3_ext_space_root(inode));
+ err = ext3_ext_dirty(handle, inode, path);
+ }
+ }
+out:
+ ext3_ext_tree_changed(inode);
+ ext3_ext_drop_refs(path);
+ kfree(path);
+ ext3_journal_stop(handle);
+
+ return err;
+}
+
+/*
+ * called at mount time
+ */
+void ext3_ext_init(struct super_block *sb)
+{
+ /*
+ * possible initialization would be here
+ */
+
+ if (test_opt(sb, EXTENTS)) {
+ printk("EXT3-fs: file extents enabled");
+#ifdef AGRESSIVE_TEST
+ printk(", agressive tests");
+#endif
+#ifdef CHECK_BINSEARCH
+ printk(", check binsearch");
+#endif
+#ifdef EXTENTS_STATS
+ printk(", stats");
+#endif
+ printk("\n");
+#ifdef EXTENTS_STATS
+ spin_lock_init(&EXT3_SB(sb)->s_ext_stats_lock);
+ EXT3_SB(sb)->s_ext_min = 1 << 30;
+ EXT3_SB(sb)->s_ext_max = 0;
+#endif
+ }
+}
+
+/*
+ * called at umount time
+ */
+void ext3_ext_release(struct super_block *sb)
+{
+ if (!test_opt(sb, EXTENTS))
+ return;
+
+#ifdef EXTENTS_STATS
+ if (EXT3_SB(sb)->s_ext_blocks && EXT3_SB(sb)->s_ext_extents) {
+ struct ext3_sb_info *sbi = EXT3_SB(sb);
+ printk(KERN_ERR "EXT3-fs: %lu blocks in %lu extents (%lu ave)\n",
+ sbi->s_ext_blocks, sbi->s_ext_extents,
+ sbi->s_ext_blocks / sbi->s_ext_extents);
+ printk(KERN_ERR "EXT3-fs: extents: %lu min, %lu max, max depth %lu\n",
+ sbi->s_ext_min, sbi->s_ext_max, sbi->s_depth_max);
+ }
+#endif
+}
+
+int ext3_ext_get_blocks(handle_t *handle, struct inode *inode, sector_t iblock,
+ unsigned long max_blocks, struct buffer_head *bh_result,
+ int create, int extend_disksize)
+{
+ struct ext3_ext_path *path = NULL;
+ struct ext3_extent newex, *ex;
+ int goal, newblock, err = 0, depth;
+ unsigned long allocated = 0;
+
+ __clear_bit(BH_New, &bh_result->b_state);
+ ext_debug("blocks %d/%lu requested for inode %u\n", (int) iblock,
+ max_blocks, (unsigned) inode->i_ino);
+ mutex_lock(&EXT3_I(inode)->truncate_mutex);
+
+ /* check in cache */
+ if ((goal = ext3_ext_in_cache(inode, iblock, &newex))) {
+ if (goal == EXT3_EXT_CACHE_GAP) {
+ if (!create) {
+ /* block isn't allocated yet and
+ * user don't want to allocate it */
+ goto out2;
+ }
+ /* we should allocate requested block */
+ } else if (goal == EXT3_EXT_CACHE_EXTENT) {
+ /* block is already allocated */
+ newblock = iblock
+ - le32_to_cpu(newex.ee_block)
+ + le32_to_cpu(newex.ee_start);
+ /* number of remain blocks in the extent */
+ allocated = le16_to_cpu(newex.ee_len) -
+ (iblock - le32_to_cpu(newex.ee_block));
+ goto out;
+ } else {
+ BUG();
+ }
+ }
+
+ /* find extent for this block */
+ path = ext3_ext_find_extent(inode, iblock, NULL);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ path = NULL;
+ goto out2;
+ }
+
+ depth = ext_depth(inode);
+
+ /*
+ * consistent leaf must not be empty
+ * this situations is possible, though, _during_ tree modification
+ * this is why assert can't be put in ext3_ext_find_extent()
+ */
+ BUG_ON(path[depth].p_ext == NULL && depth != 0);
+
+ if ((ex = path[depth].p_ext)) {
+ unsigned long ee_block = le32_to_cpu(ex->ee_block);
+ unsigned long ee_start = le32_to_cpu(ex->ee_start);
+ unsigned short ee_len = le16_to_cpu(ex->ee_len);
+ /* if found exent covers block, simple return it */
+ if (iblock >= ee_block && iblock < ee_block + ee_len) {
+ newblock = iblock - ee_block + ee_start;
+ /* number of remain blocks in the extent */
+ allocated = ee_len - (iblock - ee_block);
+ ext_debug("%d fit into %lu:%d -> %d\n", (int) iblock,
+ ee_block, ee_len, newblock);
+ ext3_ext_put_in_cache(inode, ee_block, ee_len,
+ ee_start, EXT3_EXT_CACHE_EXTENT);
+ goto out;
+ }
+ }
+
+ /*
+ * requested block isn't allocated yet
+ * we couldn't try to create block if create flag is zero
+ */
+ if (!create) {
+ /* put just found gap into cache to speedup subsequest reqs */
+ ext3_ext_put_gap_in_cache(inode, path, iblock);
+ goto out2;
+ }
+
+ /* allocate new block */
+ goal = ext3_ext_find_goal(inode, path, iblock);
+ allocated = max_blocks;
+ newblock = ext3_new_blocks(handle, inode, goal, &allocated, &err);
+ if (!newblock)
+ goto out2;
+ ext_debug("allocate new block: goal %d, found %d/%lu\n",
+ goal, newblock, allocated);
+
+ /* try to insert new extent into found leaf and return */
+ newex.ee_block = cpu_to_le32(iblock);
+ newex.ee_start = cpu_to_le32(newblock);
+ newex.ee_len = cpu_to_le16(allocated);
+ err = ext3_ext_insert_extent(handle, inode, path, &newex);
+ if (err)
+ goto out2;
+
+ if (extend_disksize && inode->i_size > EXT3_I(inode)->i_disksize)
+ EXT3_I(inode)->i_disksize = inode->i_size;
+
+ /* previous routine could use block we allocated */
+ newblock = le32_to_cpu(newex.ee_start);
+ __set_bit(BH_New, &bh_result->b_state);
+
+ ext3_ext_put_in_cache(inode, iblock, allocated, newblock,
+ EXT3_EXT_CACHE_EXTENT);
+out:
+ if (allocated > max_blocks)
+ allocated = max_blocks;
+ ext3_ext_show_leaf(inode, path);
+ __set_bit(BH_Mapped, &bh_result->b_state);
+ bh_result->b_bdev = inode->i_sb->s_bdev;
+ bh_result->b_blocknr = newblock;
+out2:
+ if (path) {
+ ext3_ext_drop_refs(path);
+ kfree(path);
+ }
+ mutex_unlock(&EXT3_I(inode)->truncate_mutex);
+
+ return err ? err : allocated;
+}
+
+void ext3_ext_truncate(struct inode * inode, struct page *page)
+{
+ struct address_space *mapping = inode->i_mapping;
+ struct super_block *sb = inode->i_sb;
+ unsigned long last_block;
+ handle_t *handle;
+ int err = 0;
+
+ /*
+ * probably first extent we're gonna free will be last in block
+ */
+ err = ext3_writepage_trans_blocks(inode) + 3;
+ handle = ext3_journal_start(inode, err);
+ if (IS_ERR(handle)) {
+ if (page) {
+ clear_highpage(page);
+ flush_dcache_page(page);
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ return;
+ }
+
+ if (page)
+ ext3_block_truncate_page(handle, page, mapping, inode->i_size);
+
+ mutex_lock(&EXT3_I(inode)->truncate_mutex);
+ ext3_ext_invalidate_cache(inode);
+
+ /*
+ * TODO: optimization is possible here
+ * probably we need not scaning at all,
+ * because page truncation is enough
+ */
+ if (ext3_orphan_add(handle, inode))
+ goto out_stop;
+
+ /* we have to know where to truncate from in crash case */
+ EXT3_I(inode)->i_disksize = inode->i_size;
+ ext3_mark_inode_dirty(handle, inode);
+
+ last_block = (inode->i_size + sb->s_blocksize - 1)
+ >> EXT3_BLOCK_SIZE_BITS(sb);
+ err = ext3_ext_remove_space(inode, last_block);
+
+ /* In a multi-transaction truncate, we only make the final
+ * transaction synchronous */
+ if (IS_SYNC(inode))
+ handle->h_sync = 1;
+
+out_stop:
+ /*
+ * If this was a simple ftruncate(), and the file will remain alive
+ * then we need to clear up the orphan record which we created above.
+ * However, if this was a real unlink then we were called by
+ * ext3_delete_inode(), and we allow that function to clean up the
+ * orphan info for us.
+ */
+ if (inode->i_nlink)
+ ext3_orphan_del(handle, inode);
+
+ mutex_unlock(&EXT3_I(inode)->truncate_mutex);
+ ext3_journal_stop(handle);
+}
+
+/*
+ * this routine calculate max number of blocks we could modify
+ * in order to allocate new block for an inode
+ */
+int ext3_ext_writepage_trans_blocks(struct inode *inode, int num)
+{
+ int needed;
+
+ needed = ext3_ext_calc_credits_for_insert(inode, NULL);
+
+ /* caller want to allocate num blocks, but note it includes sb */
+ needed = needed * num - (num - 1);
+
+#ifdef CONFIG_QUOTA
+ needed += 2 * EXT3_QUOTA_TRANS_BLOCKS(inode->i_sb);
+#endif
+
+ return needed;
+}
+
+EXPORT_SYMBOL(ext3_mark_inode_dirty);
+EXPORT_SYMBOL(ext3_ext_invalidate_cache);
+EXPORT_SYMBOL(ext3_ext_insert_extent);
+EXPORT_SYMBOL(ext3_ext_walk_space);
+EXPORT_SYMBOL(ext3_ext_find_goal);
+EXPORT_SYMBOL(ext3_ext_calc_credits_for_insert);
+
diff -puN fs/ext3/ialloc.c~ext3-extents fs/ext3/ialloc.c
--- linux-2.6.17/fs/ext3/ialloc.c~ext3-extents 2006-06-28 13:25:19.631990895 -0700
+++ linux-2.6.17-ming/fs/ext3/ialloc.c 2006-06-28 13:39:24.990992777 -0700
@@ -616,6 +616,17 @@ got:
ext3_std_error(sb, err);
goto fail_free_drop;
}
+ if (test_opt(sb, EXTENTS)) {
+ EXT3_I(inode)->i_flags |= EXT3_EXTENTS_FL;
+ ext3_ext_tree_init(handle, inode);
+ if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS)) {
+ err = ext3_journal_get_write_access(handle, EXT3_SB(sb)->s_sbh);
+ if (err) goto fail;
+ EXT3_SET_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS);
+ BUFFER_TRACE(EXT3_SB(sb)->s_sbh, "call ext3_journal_dirty_metadata");
+ err = ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh);
+ }
+ }
ext3_debug("allocating inode %lu\n", inode->i_ino);
goto really_out;
diff -puN fs/ext3/inode.c~ext3-extents fs/ext3/inode.c
--- linux-2.6.17/fs/ext3/inode.c~ext3-extents 2006-06-28 13:25:19.648988944 -0700
+++ linux-2.6.17-ming/fs/ext3/inode.c 2006-06-28 13:39:05.103274732 -0700
@@ -39,8 +39,6 @@
#include "xattr.h"
#include "acl.h"
-static int ext3_writepage_trans_blocks(struct inode *inode);
-
/*
* Test whether an inode is a fast symlink.
*/
@@ -803,6 +801,7 @@ int ext3_get_blocks_handle(handle_t *han
ext3_fsblk_t first_block = 0;
+ J_ASSERT(!(EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL));
J_ASSERT(handle != NULL || create == 0);
depth = ext3_block_to_path(inode,iblock,offsets,&blocks_to_boundary);
@@ -983,7 +982,7 @@ static int ext3_get_block(struct inode *
get_block:
if (ret == 0) {
- ret = ext3_get_blocks_handle(handle, inode, iblock,
+ ret = ext3_get_blocks_wrap(handle, inode, iblock,
max_blocks, bh_result, create, 0);
if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
@@ -1007,7 +1006,7 @@ struct buffer_head *ext3_getblk(handle_t
dummy.b_state = 0;
dummy.b_blocknr = -1000;
buffer_trace_init(&dummy.b_history);
- err = ext3_get_blocks_handle(handle, inode, block, 1,
+ err = ext3_get_blocks_wrap(handle, inode, block, 1,
&dummy, create, 1);
if (err == 1) {
err = 0;
@@ -1755,7 +1754,7 @@ void ext3_set_aops(struct inode *inode)
* This required during truncate. We need to physically zero the tail end
* of that block so it doesn't yield old data if the file is later grown.
*/
-static int ext3_block_truncate_page(handle_t *handle, struct page *page,
+int ext3_block_truncate_page(handle_t *handle, struct page *page,
struct address_space *mapping, loff_t from)
{
ext3_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -2259,6 +2258,9 @@ void ext3_truncate(struct inode *inode)
return;
}
+ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL)
+ return ext3_ext_truncate(inode, page);
+
handle = start_transaction(inode);
if (IS_ERR(handle)) {
if (page) {
@@ -3001,12 +3003,15 @@ err_out:
* block and work out the exact number of indirects which are touched. Pah.
*/
-static int ext3_writepage_trans_blocks(struct inode *inode)
+int ext3_writepage_trans_blocks(struct inode *inode)
{
int bpp = ext3_journal_blocks_per_page(inode);
int indirects = (EXT3_NDIR_BLOCKS % bpp) ? 5 : 3;
int ret;
+ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL)
+ return ext3_ext_writepage_trans_blocks(inode, bpp);
+
if (ext3_should_journal_data(inode))
ret = 3 * (bpp + indirects) + 2;
else
diff -puN fs/ext3/ioctl.c~ext3-extents fs/ext3/ioctl.c
--- linux-2.6.17/fs/ext3/ioctl.c~ext3-extents 2006-06-28 13:25:19.649988830 -0700
+++ linux-2.6.17-ming/fs/ext3/ioctl.c 2006-06-28 13:25:19.680985273 -0700
@@ -247,7 +247,6 @@ flags_err:
return err;
}
-
default:
return -ENOTTY;
}
diff -puN fs/ext3/Makefile~ext3-extents fs/ext3/Makefile
--- linux-2.6.17/fs/ext3/Makefile~ext3-extents 2006-06-28 13:25:19.651988600 -0700
+++ linux-2.6.17-ming/fs/ext3/Makefile 2006-06-28 13:25:19.681985158 -0700
@@ -5,7 +5,7 @@
obj-$(CONFIG_EXT3_FS) += ext3.o
ext3-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \
- ioctl.o namei.o super.o symlink.o hash.o resize.o
+ ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o
ext3-$(CONFIG_EXT3_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext3-$(CONFIG_EXT3_FS_POSIX_ACL) += acl.o
diff -puN fs/ext3/super.c~ext3-extents fs/ext3/super.c
--- linux-2.6.17/fs/ext3/super.c~ext3-extents 2006-06-28 13:25:19.652988486 -0700
+++ linux-2.6.17-ming/fs/ext3/super.c 2006-06-28 13:39:24.996992088 -0700
@@ -390,6 +390,7 @@ static void ext3_put_super (struct super
struct ext3_super_block *es = sbi->s_es;
int i;
+ ext3_ext_release(sb);
ext3_xattr_put_super(sb);
journal_destroy(sbi->s_journal);
if (!(sb->s_flags & MS_RDONLY)) {
@@ -454,6 +455,7 @@ static struct inode *ext3_alloc_inode(st
#endif
ei->i_block_alloc_info = NULL;
ei->vfs_inode.i_version = 1;
+ memset(&ei->i_cached_extent, 0, sizeof(struct ext3_ext_cache));
return &ei->vfs_inode;
}
@@ -636,7 +638,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
- Opt_grpquota
+ Opt_grpquota, Opt_extents,
};
static match_table_t tokens = {
@@ -686,6 +688,7 @@ static match_table_t tokens = {
{Opt_quota, "quota"},
{Opt_usrquota, "usrquota"},
{Opt_barrier, "barrier=%u"},
+ {Opt_extents, "extents"},
{Opt_err, NULL},
{Opt_resize, "resize"},
};
@@ -1018,6 +1021,9 @@ clear_qf_name:
case Opt_bh:
clear_opt(sbi->s_mount_opt, NOBH);
break;
+ case Opt_extents:
+ set_opt (sbi->s_mount_opt, EXTENTS);
+ break;
default:
printk (KERN_ERR
"EXT3-fs: Unrecognized mount option \"%s\" "
@@ -1743,6 +1749,8 @@ static int ext3_fill_super (struct super
test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
"writeback");
+ ext3_ext_init(sb);
+
lock_kernel();
return 0;
diff -puN /dev/null include/linux/ext3_fs_extents.h
--- /dev/null 2006-06-28 00:02:13.345547960 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_extents.h 2006-06-28 13:39:22.745250457 -0700
@@ -0,0 +1,196 @@
+/*
+ * Copyright (c) 2003-2006, Cluster File Systems, Inc, info@clusterfs.com
+ * Written by Alex Tomas <alex@clusterfs.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public Licens
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
+ */
+
+#ifndef _LINUX_EXT3_EXTENTS
+#define _LINUX_EXT3_EXTENTS
+
+#include <linux/ext3_fs.h>
+
+/*
+ * with AGRESSIVE_TEST defined capacity of index/leaf blocks
+ * become very little, so index split, in-depth growing and
+ * other hard changes happens much more often
+ * this is for debug purposes only
+ */
+#define AGRESSIVE_TEST_
+
+/*
+ * with EXTENTS_STATS defined number of blocks and extents
+ * are collected in truncate path. they'll be showed at
+ * umount time
+ */
+#define EXTENTS_STATS__
+
+/*
+ * if CHECK_BINSEARCH defined, then results of binary search
+ * will be checked by linear search
+ */
+#define CHECK_BINSEARCH__
+
+/*
+ * if EXT_DEBUG is defined you can use 'extdebug' mount option
+ * to get lots of info what's going on
+ */
+#define EXT_DEBUG__
+#ifdef EXT_DEBUG
+#define ext_debug(a...) printk(a)
+#else
+#define ext_debug(a...)
+#endif
+
+/*
+ * if EXT_STATS is defined then stats numbers are collected
+ * these number will be displayed at umount time
+ */
+#define EXT_STATS_
+
+
+/*
+ * ext3_inode has i_block array (60 bytes total)
+ * first 12 bytes store ext3_extent_header
+ * the remain stores array of ext3_extent
+ */
+
+/*
+ * this is extent on-disk structure
+ * it's used at the bottom of the tree
+ */
+struct ext3_extent {
+ __le32 ee_block; /* first logical block extent covers */
+ __le16 ee_len; /* number of blocks covered by extent */
+ __le16 ee_start_hi; /* high 16 bits of physical block */
+ __le32 ee_start; /* low 32 bigs of physical block */
+};
+
+/*
+ * this is index on-disk structure
+ * it's used at all the levels, but the bottom
+ */
+struct ext3_extent_idx {
+ __le32 ei_block; /* index covers logical blocks from 'block' */
+ __le32 ei_leaf; /* pointer to the physical block of the next *
+ * level. leaf or next index could bet here */
+ __le16 ei_leaf_hi; /* high 16 bits of physical block */
+ __u16 ei_unused;
+};
+
+/*
+ * each block (leaves and indexes), even inode-stored has header
+ */
+struct ext3_extent_header {
+ __le16 eh_magic; /* probably will support different formats */
+ __le16 eh_entries; /* number of valid entries */
+ __le16 eh_max; /* capacity of store in entries */
+ __le16 eh_depth; /* has tree real underlaying blocks? */
+ __le32 eh_generation; /* generation of the tree */
+};
+
+#define EXT3_EXT_MAGIC cpu_to_le16(0xf30a)
+
+/*
+ * array of ext3_ext_path contains path to some extent
+ * creation/lookup routines use it for traversal/splitting/etc
+ * truncate uses it to simulate recursive walking
+ */
+struct ext3_ext_path {
+ __u32 p_block;
+ __u16 p_depth;
+ struct ext3_extent *p_ext;
+ struct ext3_extent_idx *p_idx;
+ struct ext3_extent_header *p_hdr;
+ struct buffer_head *p_bh;
+};
+
+/*
+ * structure for external API
+ */
+
+#define EXT3_EXT_CACHE_NO 0
+#define EXT3_EXT_CACHE_GAP 1
+#define EXT3_EXT_CACHE_EXTENT 2
+
+/*
+ * to be called by ext3_ext_walk_space()
+ * negative retcode - error
+ * positive retcode - signal for ext3_ext_walk_space(), see below
+ * callback must return valid extent (passed or newly created)
+ */
+typedef int (*ext_prepare_callback)(struct inode *, struct ext3_ext_path *,
+ struct ext3_ext_cache *,
+ void *);
+
+#define EXT_CONTINUE 0
+#define EXT_BREAK 1
+#define EXT_REPEAT 2
+
+
+#define EXT_MAX_BLOCK 0xffffffff
+
+
+#define EXT_FIRST_EXTENT(__hdr__) \
+ ((struct ext3_extent *) (((char *) (__hdr__)) + \
+ sizeof(struct ext3_extent_header)))
+#define EXT_FIRST_INDEX(__hdr__) \
+ ((struct ext3_extent_idx *) (((char *) (__hdr__)) + \
+ sizeof(struct ext3_extent_header)))
+#define EXT_HAS_FREE_INDEX(__path__) \
+ (le16_to_cpu((__path__)->p_hdr->eh_entries) \
+ < le16_to_cpu((__path__)->p_hdr->eh_max))
+#define EXT_LAST_EXTENT(__hdr__) \
+ (EXT_FIRST_EXTENT((__hdr__)) + le16_to_cpu((__hdr__)->eh_entries) - 1)
+#define EXT_LAST_INDEX(__hdr__) \
+ (EXT_FIRST_INDEX((__hdr__)) + le16_to_cpu((__hdr__)->eh_entries) - 1)
+#define EXT_MAX_EXTENT(__hdr__) \
+ (EXT_FIRST_EXTENT((__hdr__)) + le16_to_cpu((__hdr__)->eh_max) - 1)
+#define EXT_MAX_INDEX(__hdr__) \
+ (EXT_FIRST_INDEX((__hdr__)) + le16_to_cpu((__hdr__)->eh_max) - 1)
+
+static inline struct ext3_extent_header *ext_inode_hdr(struct inode *inode)
+{
+ return (struct ext3_extent_header *) EXT3_I(inode)->i_data;
+}
+
+static inline struct ext3_extent_header *ext_block_hdr(struct buffer_head *bh)
+{
+ return (struct ext3_extent_header *) bh->b_data;
+}
+
+static inline unsigned short ext_depth(struct inode *inode)
+{
+ return le16_to_cpu(ext_inode_hdr(inode)->eh_depth);
+}
+
+static inline void ext3_ext_tree_changed(struct inode *inode)
+{
+ EXT3_I(inode)->i_ext_generation++;
+}
+
+static inline void
+ext3_ext_invalidate_cache(struct inode *inode)
+{
+ EXT3_I(inode)->i_cached_extent.ec_type = EXT3_EXT_CACHE_NO;
+}
+
+extern int ext3_extent_tree_init(handle_t *, struct inode *);
+extern int ext3_ext_calc_credits_for_insert(struct inode *, struct ext3_ext_path *);
+extern int ext3_ext_insert_extent(handle_t *, struct inode *, struct ext3_ext_path *, struct ext3_extent *);
+extern int ext3_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);
+extern struct ext3_ext_path * ext3_ext_find_extent(struct inode *, int, struct ext3_ext_path *);
+
+#endif /* _LINUX_EXT3_EXTENTS */
+
diff -puN include/linux/ext3_fs.h~ext3-extents include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~ext3-extents 2006-06-28 13:25:19.654988256 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h 2006-06-28 13:39:24.998991859 -0700
@@ -182,8 +182,9 @@ struct ext3_group_desc
#define EXT3_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */
#define EXT3_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/
#define EXT3_RESERVED_FL 0x80000000 /* reserved for ext3 lib */
+#define EXT3_EXTENTS_FL 0x00080000 /* Inode uses extents */
-#define EXT3_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
+#define EXT3_FL_USER_VISIBLE 0x000BDFFF /* User visible flags */
#define EXT3_FL_USER_MODIFIABLE 0x000380FF /* User modifiable flags */
/*
@@ -371,6 +372,7 @@ struct ext3_inode {
#define EXT3_MOUNT_QUOTA 0x80000 /* Some quota option set */
#define EXT3_MOUNT_USRQUOTA 0x100000 /* "old" user quota */
#define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
+#define EXT3_MOUNT_EXTENTS 0x400000 /* Extents support */
/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
@@ -560,11 +562,13 @@ static inline struct ext3_inode_info *EX
#define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004 /* Needs recovery */
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */
#define EXT3_FEATURE_INCOMPAT_META_BG 0x0010
+#define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040 /* extents support */
#define EXT3_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT3_FEATURE_INCOMPAT_SUPP (EXT3_FEATURE_INCOMPAT_FILETYPE| \
EXT3_FEATURE_INCOMPAT_RECOVER| \
- EXT3_FEATURE_INCOMPAT_META_BG)
+ EXT3_FEATURE_INCOMPAT_META_BG| \
+ EXT3_FEATURE_INCOMPAT_EXTENTS)
#define EXT3_FEATURE_RO_COMPAT_SUPP (EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \
EXT3_FEATURE_RO_COMPAT_BTREE_DIR)
@@ -803,6 +807,9 @@ extern int ext3_get_inode_loc(struct ino
extern void ext3_truncate (struct inode *);
extern void ext3_set_inode_flags(struct inode *);
extern void ext3_set_aops(struct inode *inode);
+extern int ext3_writepage_trans_blocks(struct inode *);
+extern int ext3_block_truncate_page(handle_t *handle, struct page *page,
+ struct address_space *mapping, loff_t from);
/* ioctl.c */
extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
@@ -856,6 +863,26 @@ extern struct inode_operations ext3_spec
extern struct inode_operations ext3_symlink_inode_operations;
extern struct inode_operations ext3_fast_symlink_inode_operations;
+/* extents.c */
+extern int ext3_ext_tree_init(handle_t *handle, struct inode *);
+extern int ext3_ext_writepage_trans_blocks(struct inode *, int);
+extern int ext3_ext_get_blocks(handle_t *, struct inode *, sector_t,
+ unsigned long, struct buffer_head *, int, int);
+extern void ext3_ext_truncate(struct inode *, struct page *);
+extern void ext3_ext_init(struct super_block *);
+extern void ext3_ext_release(struct super_block *);
+static inline int
+ext3_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
+ unsigned long max_blocks, struct buffer_head *bh,
+ int create, int extend_disksize)
+{
+ if (EXT3_I(inode)->i_flags & EXT3_EXTENTS_FL)
+ return ext3_ext_get_blocks(handle, inode, block, max_blocks,
+ bh, create, extend_disksize);
+ return ext3_get_blocks_handle(handle, inode, block, max_blocks, bh,
+ create, extend_disksize);
+}
+
#endif /* __KERNEL__ */
diff -puN include/linux/ext3_fs_i.h~ext3-extents include/linux/ext3_fs_i.h
--- linux-2.6.17/include/linux/ext3_fs_i.h~ext3-extents 2006-06-28 13:25:19.670986420 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_i.h 2006-06-28 13:39:24.999991744 -0700
@@ -65,6 +65,16 @@ struct ext3_block_alloc_info {
#define rsv_end rsv_window._rsv_end
/*
+ * storage for cached extent
+ */
+struct ext3_ext_cache {
+ __u32 ec_start;
+ __u32 ec_block;
+ __u32 ec_len; /* must be 32bit to return holes */
+ __u32 ec_type;
+};
+
+/*
* third extended file system inode data in memory
*/
struct ext3_inode_info {
@@ -142,6 +152,9 @@ struct ext3_inode_info {
*/
struct mutex truncate_mutex;
struct inode vfs_inode;
+
+ unsigned long i_ext_generation;
+ struct ext3_ext_cache i_cached_extent;
};
#endif /* _LINUX_EXT3_FS_I */
diff -puN include/linux/ext3_fs_sb.h~ext3-extents include/linux/ext3_fs_sb.h
--- linux-2.6.17/include/linux/ext3_fs_sb.h~ext3-extents 2006-06-28 13:25:19.672986191 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_sb.h 2006-06-28 13:25:19.686984585 -0700
@@ -78,6 +78,16 @@ struct ext3_sb_info {
char *s_qf_names[MAXQUOTAS]; /* Names of quota files with journalled quota */
int s_jquota_fmt; /* Format of quota to use */
#endif
+
+#ifdef EXTENTS_STATS
+ /* ext3 extents stats */
+ unsigned long s_ext_min;
+ unsigned long s_ext_max;
+ unsigned long s_depth_max;
+ spinlock_t s_ext_stats_lock;
+ unsigned long s_ext_blocks;
+ unsigned long s_ext_extents;
+#endif
};
#endif /* _LINUX_EXT3_FS_SB */
diff -puN include/linux/ext3_jbd.h~ext3-extents include/linux/ext3_jbd.h
--- linux-2.6.17/include/linux/ext3_jbd.h~ext3-extents 2006-06-28 13:25:19.673986076 -0700
+++ linux-2.6.17-ming/include/linux/ext3_jbd.h 2006-06-28 13:39:09.692748127 -0700
@@ -26,9 +26,14 @@
*
* We may have to touch one inode, one bitmap buffer, up to three
* indirection blocks, the group and superblock summaries, and the data
- * block to complete the transaction. */
-
-#define EXT3_SINGLEDATA_TRANS_BLOCKS 8U
+ * block to complete the transaction.
+ *
+ * For extents-enabled fs we may have to allocate and modify upto
+ * 5 levels of tree + root which is stored in inode. */
+
+#define EXT3_SINGLEDATA_TRANS_BLOCKS(sb) \
+ (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_EXTENTS) \
+ || test_opt(sb, EXTENTS) ? 27U : 8U)
/* Extended attribute operations touch at most two data buffers,
* two bitmap buffers, and two group summaries, in addition to the inode
@@ -42,7 +47,7 @@
* superblock only gets updated once, of course, so don't bother
* counting that again for the quota updates. */
-#define EXT3_DATA_TRANS_BLOCKS(sb) (EXT3_SINGLEDATA_TRANS_BLOCKS + \
+#define EXT3_DATA_TRANS_BLOCKS(sb) (EXT3_SINGLEDATA_TRANS_BLOCKS(sb) + \
EXT3_XATTR_TRANS_BLOCKS - 2 + \
2*EXT3_QUOTA_TRANS_BLOCKS(sb))
@@ -78,9 +83,9 @@
/* Amount of blocks needed for quota insert/delete - we do some block writes
* but inode, sb and group updates are done only once */
#define EXT3_QUOTA_INIT_BLOCKS(sb) (test_opt(sb, QUOTA) ? (DQUOT_INIT_ALLOC*\
- (EXT3_SINGLEDATA_TRANS_BLOCKS-3)+3+DQUOT_INIT_REWRITE) : 0)
+ (EXT3_SINGLEDATA_TRANS_BLOCKS(sb)-3)+3+DQUOT_INIT_REWRITE) : 0)
#define EXT3_QUOTA_DEL_BLOCKS(sb) (test_opt(sb, QUOTA) ? (DQUOT_DEL_ALLOC*\
- (EXT3_SINGLEDATA_TRANS_BLOCKS-3)+3+DQUOT_DEL_REWRITE) : 0)
+ (EXT3_SINGLEDATA_TRANS_BLOCKS(sb)-3)+3+DQUOT_DEL_REWRITE) : 0)
#else
#define EXT3_QUOTA_TRANS_BLOCKS(sb) 0
#define EXT3_QUOTA_INIT_BLOCKS(sb) 0
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 2/16]sector_t type format string
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (4 preceding siblings ...)
2006-06-30 0:16 ` [RFC][Update][Patch 1/16]core extent map support Mingming Cao
@ 2006-06-30 0:17 ` Mingming Cao
2006-06-30 0:17 ` [RFC][Update][Patch 3/16]convert ext3_fsblk_t to sector_t to support >32 bit block in kernel Mingming Cao
` (13 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:17 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
Define SECTOR_FMT to print sector_t in proper format
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Acked-by: Andreas Dilger <adilger@clusterfs.com>
---
linux-2.6.17-ming/include/asm-h8300/types.h | 1 +
linux-2.6.17-ming/include/asm-i386/types.h | 1 +
linux-2.6.17-ming/include/asm-mips/types.h | 5 +++++
linux-2.6.17-ming/include/asm-powerpc/types.h | 5 +++++
linux-2.6.17-ming/include/asm-s390/types.h | 5 +++++
linux-2.6.17-ming/include/asm-sh/types.h | 1 +
linux-2.6.17-ming/include/asm-x86_64/types.h | 1 +
linux-2.6.17-ming/include/linux/types.h | 1 +
8 files changed, 20 insertions(+)
diff -puN include/asm-h8300/types.h~sector_fmt include/asm-h8300/types.h
--- linux-2.6.17/include/asm-h8300/types.h~sector_fmt 2006-06-28 16:46:28.523183099 -0700
+++ linux-2.6.17-ming/include/asm-h8300/types.h 2006-06-28 16:46:28.552179772 -0700
@@ -57,6 +57,7 @@ typedef u32 dma_addr_t;
#define HAVE_SECTOR_T
typedef u64 sector_t;
+#define SECTOR_FMT "%llu"
#define HAVE_BLKCNT_T
typedef u64 blkcnt_t;
diff -puN include/asm-i386/types.h~sector_fmt include/asm-i386/types.h
--- linux-2.6.17/include/asm-i386/types.h~sector_fmt 2006-06-28 16:46:28.526182755 -0700
+++ linux-2.6.17-ming/include/asm-i386/types.h 2006-06-28 16:46:28.553179658 -0700
@@ -59,6 +59,7 @@ typedef u64 dma64_addr_t;
#ifdef CONFIG_LBD
typedef u64 sector_t;
+#define SECTOR_FMT "%llu"
#define HAVE_SECTOR_T
#endif
diff -puN include/asm-mips/types.h~sector_fmt include/asm-mips/types.h
--- linux-2.6.17/include/asm-mips/types.h~sector_fmt 2006-06-28 16:46:28.530182296 -0700
+++ linux-2.6.17-ming/include/asm-mips/types.h 2006-06-28 16:46:28.554179543 -0700
@@ -95,6 +95,11 @@ typedef unsigned long phys_t;
#ifdef CONFIG_LBD
typedef u64 sector_t;
+#if (_MIPS_SZLONG == 64)
+#define SECTOR_FMT "%lu"
+#else
+#define SECTOR_FMT "%llu"
+#endif
#define HAVE_SECTOR_T
#endif
diff -puN include/asm-powerpc/types.h~sector_fmt include/asm-powerpc/types.h
--- linux-2.6.17/include/asm-powerpc/types.h~sector_fmt 2006-06-28 16:46:28.534181837 -0700
+++ linux-2.6.17-ming/include/asm-powerpc/types.h 2006-06-28 16:46:28.554179543 -0700
@@ -99,6 +99,11 @@ typedef struct {
#ifdef CONFIG_LBD
typedef u64 sector_t;
+#ifdef __powerpc64__
+#define SECTOR_FMT "%lu"
+#else
+#define SECTOR_FMT "%llu"
+#endif
#define HAVE_SECTOR_T
#endif
diff -puN include/asm-s390/types.h~sector_fmt include/asm-s390/types.h
--- linux-2.6.17/include/asm-s390/types.h~sector_fmt 2006-06-28 16:46:28.537181493 -0700
+++ linux-2.6.17-ming/include/asm-s390/types.h 2006-06-28 16:46:28.555179428 -0700
@@ -89,6 +89,11 @@ typedef union {
#ifdef CONFIG_LBD
typedef u64 sector_t;
+#ifndef __s390x__
+#define SECTOR_FMT "%llu"
+#else
+#define SECTOR_FMT "%lu"
+#endif
#define HAVE_SECTOR_T
#endif
diff -puN include/asm-sh/types.h~sector_fmt include/asm-sh/types.h
--- linux-2.6.17/include/asm-sh/types.h~sector_fmt 2006-06-28 16:46:28.540181149 -0700
+++ linux-2.6.17-ming/include/asm-sh/types.h 2006-06-28 16:46:28.555179428 -0700
@@ -54,6 +54,7 @@ typedef u32 dma_addr_t;
#ifdef CONFIG_LBD
typedef u64 sector_t;
+#define SECTOR_FMT "%llu"
#define HAVE_SECTOR_T
#endif
diff -puN include/asm-x86_64/types.h~sector_fmt include/asm-x86_64/types.h
--- linux-2.6.17/include/asm-x86_64/types.h~sector_fmt 2006-06-28 16:46:28.543180805 -0700
+++ linux-2.6.17-ming/include/asm-x86_64/types.h 2006-06-28 16:46:28.556179313 -0700
@@ -49,6 +49,7 @@ typedef u64 dma64_addr_t;
typedef u64 dma_addr_t;
typedef u64 sector_t;
+#define SECTOR_FMT "%llu"
#define HAVE_SECTOR_T
#endif /* __ASSEMBLY__ */
diff -puN include/linux/types.h~sector_fmt include/linux/types.h
--- linux-2.6.17/include/linux/types.h~sector_fmt 2006-06-28 16:46:28.549180116 -0700
+++ linux-2.6.17-ming/include/linux/types.h 2006-06-28 16:46:28.557179199 -0700
@@ -134,6 +134,7 @@ typedef __s64 int64_t;
*/
#ifndef HAVE_SECTOR_T
typedef unsigned long sector_t;
+#define SECTOR_FMT "%lu"
#endif
#ifndef HAVE_BLKCNT_T
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 3/16]convert ext3_fsblk_t to sector_t to support >32 bit block in kernel
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (5 preceding siblings ...)
2006-06-30 0:17 ` [RFC][Update][Patch 2/16]sector_t type format string Mingming Cao
@ 2006-06-30 0:17 ` Mingming Cao
2006-06-30 0:17 ` [RFC][Update][Patch 4/16]support 48 bit blk number in extents Mingming Cao
` (12 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:17 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
Redefine ext3 in-kernel filesystem block type (ext3_fsblk_t) from unsigned
long to sector_t, to allow kernel to handle >32 bit ext3 blocks.
Signed-Off-By: Mingming Cao <cmm@us.ibm.com>
---
linux-2.6.17-ming/fs/ext3/balloc.c | 22 ++++++++--------------
linux-2.6.17-ming/fs/ext3/ialloc.c | 11 +++++++----
linux-2.6.17-ming/fs/ext3/resize.c | 14 ++++++--------
linux-2.6.17-ming/fs/ext3/super.c | 8 ++++----
linux-2.6.17-ming/include/linux/ext3_fs.h | 26 ++++++++++++++++++++++++++
linux-2.6.17-ming/include/linux/ext3_fs_i.h | 4 ++--
6 files changed, 53 insertions(+), 32 deletions(-)
diff -puN fs/ext3/balloc.c~ext3_fsblk_sector_t fs/ext3/balloc.c
--- linux-2.6.17/fs/ext3/balloc.c~ext3_fsblk_sector_t 2006-06-28 16:46:36.057318618 -0700
+++ linux-2.6.17-ming/fs/ext3/balloc.c 2006-06-28 16:46:36.082315750 -0700
@@ -38,7 +38,6 @@
#define in_range(b, first, len) ((b) >= (first) && (b) <= (first) + (len) - 1)
-
struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb,
unsigned int block_group,
struct buffer_head ** bh)
@@ -340,10 +339,7 @@ void ext3_free_blocks_sb(handle_t *handl
do_more:
overflow = 0;
- block_group = (block - le32_to_cpu(es->s_first_data_block)) /
- EXT3_BLOCKS_PER_GROUP(sb);
- bit = (block - le32_to_cpu(es->s_first_data_block)) %
- EXT3_BLOCKS_PER_GROUP(sb);
+ ext3_get_group_no_and_offset(sb, block, &block_group, &bit);
/*
* Check to see if we are freeing blocks across a group
* boundary.
@@ -1205,7 +1201,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
{
struct buffer_head *bitmap_bh = NULL;
struct buffer_head *gdp_bh;
- int group_no;
+ unsigned long group_no;
int goal_group;
ext3_grpblk_t grp_target_blk; /* blockgroup relative goal block */
ext3_grpblk_t grp_alloc_blk; /* blockgroup-relative allocated block*/
@@ -1268,8 +1264,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
if (goal < le32_to_cpu(es->s_first_data_block) ||
goal >= le32_to_cpu(es->s_blocks_count))
goal = le32_to_cpu(es->s_first_data_block);
- group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
- EXT3_BLOCKS_PER_GROUP(sb);
+ ext3_get_group_no_and_offset(sb, goal, &group_no, &grp_target_blk);
gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
if (!gdp)
goto io_error;
@@ -1286,8 +1281,6 @@ retry:
my_rsv = NULL;
if (free_blocks > 0) {
- grp_target_blk = ((goal - le32_to_cpu(es->s_first_data_block)) %
- EXT3_BLOCKS_PER_GROUP(sb));
bitmap_bh = read_block_bitmap(sb, group_no);
if (!bitmap_bh)
goto io_error;
@@ -1414,7 +1407,7 @@ allocated:
if (ret_block + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
ext3_error(sb, "ext3_new_block",
"block("E3FSBLK") >= blocks count(%d) - "
- "block_group = %d, es == %p ", ret_block,
+ "block_group = %lu, es == %p ", ret_block,
le32_to_cpu(es->s_blocks_count), group_no, es);
goto out;
}
@@ -1528,9 +1521,10 @@ ext3_fsblk_t ext3_count_free_blocks(stru
static inline int
block_in_use(ext3_fsblk_t block, struct super_block *sb, unsigned char *map)
{
- return ext3_test_bit ((block -
- le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block)) %
- EXT3_BLOCKS_PER_GROUP(sb), map);
+ ext3_grpblk_t offset;
+
+ ext3_get_group_no_and_offset(sb, block, NULL, &offset);
+ return ext3_test_bit (offset, map);
}
static inline int test_root(int a, int b)
diff -puN fs/ext3/ialloc.c~ext3_fsblk_sector_t fs/ext3/ialloc.c
--- linux-2.6.17/fs/ext3/ialloc.c~ext3_fsblk_sector_t 2006-06-28 16:46:36.060318274 -0700
+++ linux-2.6.17-ming/fs/ext3/ialloc.c 2006-06-28 16:46:36.084315520 -0700
@@ -23,7 +23,7 @@
#include <linux/buffer_head.h>
#include <linux/random.h>
#include <linux/bitops.h>
-
+#include <linux/blkdev.h>
#include <asm/byteorder.h>
#include "xattr.h"
@@ -274,7 +274,8 @@ static int find_group_orlov(struct super
freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
avefreei = freei / ngroups;
freeb = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
- avefreeb = freeb / ngroups;
+ avefreeb = freeb;
+ sector_div(avefreeb, ngroups);
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
if ((parent == sb->s_root->d_inode) ||
@@ -303,13 +304,15 @@ static int find_group_orlov(struct super
goto fallback;
}
- blocks_per_dir = (le32_to_cpu(es->s_blocks_count) - freeb) / ndirs;
+ blocks_per_dir = le32_to_cpu(es->s_blocks_count) - freeb;
+ sector_div(blocks_per_dir, ndirs);
max_dirs = ndirs / ngroups + inodes_per_group / 16;
min_inodes = avefreei - inodes_per_group / 4;
min_blocks = avefreeb - EXT3_BLOCKS_PER_GROUP(sb) / 4;
- max_debt = EXT3_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, (ext3_fsblk_t)BLOCK_COST);
+ max_debt = EXT3_BLOCKS_PER_GROUP(sb);
+ sector_div(max_debt, max(blocks_per_dir, (ext3_fsblk_t)BLOCK_COST));
if (max_debt * INODE_COST > inodes_per_group)
max_debt = inodes_per_group / INODE_COST;
if (max_debt > 255)
diff -puN fs/ext3/resize.c~ext3_fsblk_sector_t fs/ext3/resize.c
--- linux-2.6.17/fs/ext3/resize.c~ext3_fsblk_sector_t 2006-06-28 16:46:36.065317700 -0700
+++ linux-2.6.17-ming/fs/ext3/resize.c 2006-06-28 16:46:36.086315291 -0700
@@ -15,7 +15,6 @@
#include <linux/sched.h>
#include <linux/smp_lock.h>
#include <linux/ext3_jbd.h>
-
#include <linux/errno.h>
#include <linux/slab.h>
@@ -37,7 +36,7 @@ static int verify_group_input(struct sup
le16_to_cpu(es->s_reserved_gdt_blocks)) : 0;
ext3_fsblk_t metaend = start + overhead;
struct buffer_head *bh = NULL;
- ext3_grpblk_t free_blocks_count;
+ ext3_grpblk_t free_blocks_count, offset;
int err = -EINVAL;
input->free_blocks_count = free_blocks_count =
@@ -50,13 +49,13 @@ static int verify_group_input(struct sup
"no-super", input->group, input->blocks_count,
free_blocks_count, input->reserved_blocks);
+ ext3_get_group_no_and_offset(sb, start, NULL, &offset);
if (group != sbi->s_groups_count)
ext3_warning(sb, __FUNCTION__,
"Cannot add at group %u (only %lu groups)",
input->group, sbi->s_groups_count);
- else if ((start - le32_to_cpu(es->s_first_data_block)) %
- EXT3_BLOCKS_PER_GROUP(sb))
- ext3_warning(sb, __FUNCTION__, "Last group not full");
+ else if (offset != 0)
+ ext3_warning(sb, __FUNCTION__, "Last group not full");
else if (input->reserved_blocks > input->blocks_count / 5)
ext3_warning(sb, __FUNCTION__, "Reserved blocks too high (%u)",
input->reserved_blocks);
@@ -933,7 +932,7 @@ int ext3_group_extend(struct super_block
if (n_blocks_count > (sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) {
printk(KERN_ERR "EXT3-fs: filesystem on %s:"
- " too large to resize to %lu blocks safely\n",
+ " too large to resize to "E3FSBLK" blocks safely\n",
sb->s_id, n_blocks_count);
if (sizeof(sector_t) < 8)
ext3_warning(sb, __FUNCTION__,
@@ -948,8 +947,7 @@ int ext3_group_extend(struct super_block
}
/* Handle the remaining blocks in the last group only. */
- last = (o_blocks_count - le32_to_cpu(es->s_first_data_block)) %
- EXT3_BLOCKS_PER_GROUP(sb);
+ ext3_get_group_no_and_offset(sb, o_blocks_count, NULL, &last);
if (last == 0) {
ext3_warning(sb, __FUNCTION__,
diff -puN fs/ext3/super.c~ext3_fsblk_sector_t fs/ext3/super.c
--- linux-2.6.17/fs/ext3/super.c~ext3_fsblk_sector_t 2006-06-28 16:46:36.069317241 -0700
+++ linux-2.6.17-ming/fs/ext3/super.c 2006-06-28 16:46:36.090314832 -0700
@@ -1388,8 +1388,8 @@ static int ext3_fill_super (struct super
* block sizes. We need to calculate the offset from buffer start.
*/
if (blocksize != EXT3_MIN_BLOCK_SIZE) {
- logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
- offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
+ logic_sb_block = sb_block * EXT3_MIN_BLOCK_SIZE;
+ offset = sector_div(logic_sb_block, blocksize);
} else {
logic_sb_block = sb_block;
}
@@ -1494,8 +1494,8 @@ static int ext3_fill_super (struct super
brelse (bh);
sb_set_blocksize(sb, blocksize);
- logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
- offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
+ logic_sb_block = sb_block * EXT3_MIN_BLOCK_SIZE;
+ offset = sector_div(logic_sb_block, blocksize);
bh = sb_bread(sb, logic_sb_block);
if (!bh) {
printk(KERN_ERR
diff -puN include/linux/ext3_fs.h~ext3_fsblk_sector_t include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~ext3_fsblk_sector_t 2006-06-28 16:46:36.073316783 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h 2006-06-28 16:46:36.092314603 -0700
@@ -17,6 +17,7 @@
#define _LINUX_EXT3_FS_H
#include <linux/types.h>
+#include <linux/blkdev.h>
/*
* The second extended filesystem constants/structures
@@ -728,6 +729,27 @@ ext3_group_first_block_no(struct super_b
#define ERR_BAD_DX_DIR -75000
/*
+ * This function calculate the block group number and offset,
+ * given a block number
+ */
+
+static inline void ext3_get_group_no_and_offset(struct super_block * sb,
+ ext3_fsblk_t blocknr, unsigned long* blockgrpp,
+ ext3_grpblk_t *offsetp)
+{
+ struct ext3_super_block *es = EXT3_SB(sb)->s_es;
+ ext3_grpblk_t offset;
+
+ blocknr = blocknr - le32_to_cpu(es->s_first_data_block);
+ offset = sector_div(blocknr, EXT3_BLOCKS_PER_GROUP(sb));
+ if (offsetp)
+ *offsetp = offset;
+ if (blockgrpp)
+ *blockgrpp = blocknr;
+
+}
+
+/*
* Function prototypes
*/
@@ -740,6 +762,10 @@ ext3_group_first_block_no(struct super_b
# define NORET_AND noreturn,
/* balloc.c */
+extern unsigned int ext3_block_group(struct super_block *sb,
+ ext3_fsblk_t blocknr);
+extern ext3_grpblk_t ext3_block_group_offset(struct super_block *sb,
+ ext3_fsblk_t blocknr);
extern int ext3_bg_has_super(struct super_block *sb, int group);
extern unsigned long ext3_bg_num_gdb(struct super_block *sb, int group);
extern ext3_fsblk_t ext3_new_block (handle_t *handle, struct inode *inode,
diff -puN include/linux/ext3_fs_i.h~ext3_fsblk_sector_t include/linux/ext3_fs_i.h
--- linux-2.6.17/include/linux/ext3_fs_i.h~ext3_fsblk_sector_t 2006-06-28 16:46:36.077316324 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_i.h 2006-06-28 16:46:36.093314488 -0700
@@ -25,9 +25,9 @@
typedef int ext3_grpblk_t;
/* data type for filesystem-wide blocks number */
-typedef unsigned long ext3_fsblk_t;
+typedef sector_t ext3_fsblk_t;
-#define E3FSBLK "%lu"
+#define E3FSBLK SECTOR_FMT
struct ext3_reserve_window {
ext3_fsblk_t _rsv_start; /* First byte reserved */
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 4/16]support 48 bit blk number in extents
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (6 preceding siblings ...)
2006-06-30 0:17 ` [RFC][Update][Patch 3/16]convert ext3_fsblk_t to sector_t to support >32 bit block in kernel Mingming Cao
@ 2006-06-30 0:17 ` Mingming Cao
2006-06-30 0:17 ` [RFC][Update][Patch 5/16]block type convert " Mingming Cao
` (11 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:17 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
48bit physical block number support in extents.
Signed-Off-By: Alex Tomas <alex@clusterfs.com>
---
linux-2.6.17-ming/fs/ext3/extents.c | 138 +++++++++++++---------
linux-2.6.17-ming/include/linux/ext3_fs_extents.h | 2
linux-2.6.17-ming/include/linux/ext3_fs_i.h | 2
3 files changed, 87 insertions(+), 55 deletions(-)
diff -puN fs/ext3/extents.c~ext3-extents-48bit fs/ext3/extents.c
--- linux-2.6.17/fs/ext3/extents.c~ext3-extents-48bit 2006-06-28 16:46:39.848883567 -0700
+++ linux-2.6.17-ming/fs/ext3/extents.c 2006-06-28 16:46:39.863881846 -0700
@@ -44,6 +44,44 @@
#include <asm/uaccess.h>
+/* this macro combines low and hi parts of phys. blocknr into sector_t */
+static inline sector_t ext_pblock(struct ext3_extent *ex)
+{
+ sector_t block;
+
+ block = le32_to_cpu(ex->ee_start);
+ if (sizeof(sector_t) > 4)
+ block |= ((sector_t) le16_to_cpu(ex->ee_start_hi) << 31) << 1;
+ return block;
+}
+
+/* this macro combines low and hi parts of phys. blocknr into sector_t */
+static inline sector_t idx_pblock(struct ext3_extent_idx *ix)
+{
+ sector_t block;
+
+ block = le32_to_cpu(ix->ei_leaf);
+ if (sizeof(sector_t) > 4)
+ block |= ((sector_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
+ return block;
+}
+
+/* the routine stores large phys. blocknr into extent breaking it into parts */
+static inline void ext3_ext_store_pblock(struct ext3_extent *ex, sector_t pb)
+{
+ ex->ee_start = cpu_to_le32((unsigned long) (pb & 0xffffffff));
+ if (sizeof(sector_t) > 4)
+ ex->ee_start_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
+}
+
+/* the routine stores large phys. blocknr into index breaking it into parts */
+static inline void ext3_idx_store_pblock(struct ext3_extent_idx *ix, sector_t pb)
+{
+ ix->ei_leaf = cpu_to_le32((unsigned long) (pb & 0xffffffff));
+ if (sizeof(sector_t) > 4)
+ ix->ei_leaf_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
+}
+
static int ext3_ext_check_header(const char *function, struct inode *inode,
struct ext3_extent_header *eh)
{
@@ -126,7 +164,7 @@ static int ext3_ext_dirty(handle_t *hand
static int ext3_ext_find_goal(struct inode *inode,
struct ext3_ext_path *path,
- unsigned long block)
+ sector_t block)
{
struct ext3_inode_info *ei = EXT3_I(inode);
unsigned long bg_start;
@@ -139,8 +177,7 @@ static int ext3_ext_find_goal(struct ino
/* try to predict block placement */
if ((ex = path[depth].p_ext))
- return le32_to_cpu(ex->ee_start)
- + (block - le32_to_cpu(ex->ee_block));
+ return ext_pblock(ex)+(block-le32_to_cpu(ex->ee_block));
/* it looks index is empty
* try to find starting from index itself */
@@ -230,13 +267,13 @@ static void ext3_ext_show_path(struct in
ext_debug("path:");
for (k = 0; k <= l; k++, path++) {
if (path->p_idx) {
- ext_debug(" %d->%d", le32_to_cpu(path->p_idx->ei_block),
- le32_to_cpu(path->p_idx->ei_leaf));
+ ext_debug(" %d->%llu", le32_to_cpu(path->p_idx->ei_block),
+ idx_pblock(path->p_idx));
} else if (path->p_ext) {
- ext_debug(" %d:%d:%d",
+ ext_debug(" %d:%d:%lld",
le32_to_cpu(path->p_ext->ee_block),
le16_to_cpu(path->p_ext->ee_len),
- le32_to_cpu(path->p_ext->ee_start));
+ ext_pblock(path->p_ext));
} else
ext_debug(" []");
}
@@ -257,9 +294,8 @@ static void ext3_ext_show_leaf(struct in
ex = EXT_FIRST_EXTENT(eh);
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
- ext_debug("%d:%d:%d ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len),
- le32_to_cpu(ex->ee_start));
+ ext_debug("%d:%d:%lld ", le32_to_cpu(ex->ee_block),
+ le16_to_cpu(ex->ee_len), ext_pblock(ex));
}
ext_debug("\n");
}
@@ -308,8 +344,8 @@ ext3_ext_binsearch_idx(struct inode *ino
}
path->p_idx = l - 1;
- ext_debug(" -> %d->%d ", le32_to_cpu(path->p_idx->ei_block),
- le32_to_cpu(path->p_idx->ei_leaf));
+ ext_debug(" -> %d->%lld ", le32_to_cpu(path->p_idx->ei_block),
+ idx_block(path->p_idx));
#ifdef CHECK_BINSEARCH
{
@@ -374,10 +410,10 @@ ext3_ext_binsearch(struct inode *inode,
}
path->p_ext = l - 1;
- ext_debug(" -> %d:%d:%d ",
+ ext_debug(" -> %d:%lld:%d ",
le32_to_cpu(path->p_ext->ee_block),
- le32_to_cpu(path->p_ext->ee_start),
- le16_to_cpu(path->p_ext->ee_len));
+ ext_pblock(path->p_ext),
+ le16_to_cpu(path->p_ext->ee_len));
#ifdef CHECK_BINSEARCH
{
@@ -442,7 +478,7 @@ ext3_ext_find_extent(struct inode *inode
ext_debug("depth %d: num %d, max %d\n",
ppos, le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
ext3_ext_binsearch_idx(inode, path + ppos, block);
- path[ppos].p_block = le32_to_cpu(path[ppos].p_idx->ei_leaf);
+ path[ppos].p_block = idx_pblock(path[ppos].p_idx);
path[ppos].p_depth = i;
path[ppos].p_ext = NULL;
@@ -524,7 +560,7 @@ static int ext3_ext_insert_index(handle_
}
ix->ei_block = cpu_to_le32(logical);
- ix->ei_leaf = cpu_to_le32(ptr);
+ ext3_idx_store_pblock(ix, ptr);
curp->p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(curp->p_hdr->eh_entries)+1);
BUG_ON(le16_to_cpu(curp->p_hdr->eh_entries)
@@ -633,9 +669,9 @@ static int ext3_ext_split(handle_t *hand
path[depth].p_ext++;
while (path[depth].p_ext <=
EXT_MAX_EXTENT(path[depth].p_hdr)) {
- ext_debug("move %d:%d:%d in new leaf %lu\n",
+ ext_debug("move %d:%lld:%d in new leaf %lu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
- le32_to_cpu(path[depth].p_ext->ee_start),
+ ext_pblock(path[depth].p_ext),
le16_to_cpu(path[depth].p_ext->ee_len),
newblock);
/*memmove(ex++, path[depth].p_ext++,
@@ -696,7 +732,7 @@ static int ext3_ext_split(handle_t *hand
neh->eh_depth = cpu_to_le16(depth - i);
fidx = EXT_FIRST_INDEX(neh);
fidx->ei_block = border;
- fidx->ei_leaf = cpu_to_le32(oldblock);
+ ext3_idx_store_pblock(fidx, oldblock);
ext_debug("int.index at %d (block %lu): %lu -> %lu\n", i,
newblock, (unsigned long) le32_to_cpu(border),
@@ -710,9 +746,9 @@ static int ext3_ext_split(handle_t *hand
BUG_ON(EXT_MAX_INDEX(path[i].p_hdr) !=
EXT_LAST_INDEX(path[i].p_hdr));
while (path[i].p_idx <= EXT_MAX_INDEX(path[i].p_hdr)) {
- ext_debug("%d: move %d:%d in new index %lu\n", i,
+ ext_debug("%d: move %d:%d in new index %llu\n", i,
le32_to_cpu(path[i].p_idx->ei_block),
- le32_to_cpu(path[i].p_idx->ei_leaf),
+ idx_pblock(path[i].p_idx),
newblock);
/*memmove(++fidx, path[i].p_idx++,
sizeof(struct ext3_extent_idx));
@@ -839,13 +875,13 @@ static int ext3_ext_grow_indepth(handle_
curp->p_idx = EXT_FIRST_INDEX(curp->p_hdr);
/* FIXME: it works, but actually path[0] can be index */
curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block;
- curp->p_idx->ei_leaf = cpu_to_le32(newblock);
+ ext3_idx_store_pblock(curp->p_idx, newblock);
neh = ext_inode_hdr(inode);
fidx = EXT_FIRST_INDEX(neh);
- ext_debug("new root: num %d(%d), lblock %d, ptr %d\n",
+ ext_debug("new root: num %d(%d), lblock %d, ptr %llu\n",
le16_to_cpu(neh->eh_entries), le16_to_cpu(neh->eh_max),
- le32_to_cpu(fidx->ei_block), le32_to_cpu(fidx->ei_leaf));
+ le32_to_cpu(fidx->ei_block), idx_pblock(fidx));
neh->eh_depth = cpu_to_le16(path->p_depth + 1);
err = ext3_ext_dirty(handle, inode, curp);
@@ -1042,7 +1078,6 @@ static int inline
ext3_can_extents_be_merged(struct inode *inode, struct ext3_extent *ex1,
struct ext3_extent *ex2)
{
- /* FIXME: 48bit support */
if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len)
!= le32_to_cpu(ex2->ee_block))
return 0;
@@ -1052,8 +1087,7 @@ ext3_can_extents_be_merged(struct inode
return 0;
#endif
- if (le32_to_cpu(ex1->ee_start) + le16_to_cpu(ex1->ee_len)
- == le32_to_cpu(ex2->ee_start))
+ if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
return 1;
return 0;
}
@@ -1080,11 +1114,10 @@ int ext3_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex && ext3_can_extents_be_merged(inode, ex, newext)) {
- ext_debug("append %d block to %d:%d (from %d)\n",
+ ext_debug("append %d block to %d:%d (from %lld)\n",
le16_to_cpu(newext->ee_len),
le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len),
- le32_to_cpu(ex->ee_start));
+ le16_to_cpu(ex->ee_len), ext_pblock(ex));
if ((err = ext3_ext_get_access(handle, inode, path + depth)))
return err;
ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
@@ -1140,9 +1173,9 @@ has_space:
if (!nearex) {
/* there is no extent in this leaf, create first one */
- ext_debug("first extent in the leaf: %d:%d:%d\n",
+ ext_debug("first extent in the leaf: %d:%lld:%d\n",
le32_to_cpu(newext->ee_block),
- le32_to_cpu(newext->ee_start),
+ ext_pblock(newext),
le16_to_cpu(newext->ee_len));
path[depth].p_ext = EXT_FIRST_EXTENT(eh);
} else if (le32_to_cpu(newext->ee_block)
@@ -1152,10 +1185,10 @@ has_space:
len = EXT_MAX_EXTENT(eh) - nearex;
len = (len - 1) * sizeof(struct ext3_extent);
len = len < 0 ? 0 : len;
- ext_debug("insert %d:%d:%d after: nearest 0x%p, "
+ ext_debug("insert %d:%lld:%d after: nearest 0x%p, "
"move %d from 0x%p to 0x%p\n",
le32_to_cpu(newext->ee_block),
- le32_to_cpu(newext->ee_start),
+ ext_pblock(newext),
le16_to_cpu(newext->ee_len),
nearex, len, nearex + 1, nearex + 2);
memmove(nearex + 2, nearex + 1, len);
@@ -1165,10 +1198,10 @@ has_space:
BUG_ON(newext->ee_block == nearex->ee_block);
len = (EXT_MAX_EXTENT(eh) - nearex) * sizeof(struct ext3_extent);
len = len < 0 ? 0 : len;
- ext_debug("insert %d:%d:%d before: nearest 0x%p, "
+ ext_debug("insert %d:%lld:%d before: nearest 0x%p, "
"move %d from 0x%p to 0x%p\n",
le32_to_cpu(newext->ee_block),
- le32_to_cpu(newext->ee_start),
+ ext_pblock(newext),
le16_to_cpu(newext->ee_len),
nearex, len, nearex + 1, nearex + 2);
memmove(nearex + 1, nearex, len);
@@ -1179,9 +1212,8 @@ has_space:
nearex = path[depth].p_ext;
nearex->ee_block = newext->ee_block;
nearex->ee_start = newext->ee_start;
+ nearex->ee_start_hi = newext->ee_start_hi;
nearex->ee_len = newext->ee_len;
- /* FIXME: support for large fs */
- nearex->ee_start_hi = 0;
merge:
/* try to merge extents to the right */
@@ -1290,7 +1322,7 @@ int ext3_ext_walk_space(struct inode *in
} else {
cbex.ec_block = le32_to_cpu(ex->ee_block);
cbex.ec_len = le16_to_cpu(ex->ee_len);
- cbex.ec_start = le32_to_cpu(ex->ee_start);
+ cbex.ec_start = ext_pblock(ex);
cbex.ec_type = EXT3_EXT_CACHE_EXTENT;
}
@@ -1398,7 +1430,7 @@ ext3_ext_in_cache(struct inode *inode, u
cex->ec_type != EXT3_EXT_CACHE_EXTENT);
if (block >= cex->ec_block && block < cex->ec_block + cex->ec_len) {
ex->ee_block = cpu_to_le32(cex->ec_block);
- ex->ee_start = cpu_to_le32(cex->ec_start);
+ ext3_ext_store_pblock(ex, cex->ec_start);
ex->ee_len = cpu_to_le16(cex->ec_len);
ext_debug("%lu cached by %lu:%lu:%lu\n",
(unsigned long) block,
@@ -1426,7 +1458,7 @@ int ext3_ext_rm_idx(handle_t *handle, st
/* free index block */
path--;
- leaf = le32_to_cpu(path->p_idx->ei_leaf);
+ leaf = idx_pblock(path->p_idx);
BUG_ON(path->p_hdr->eh_entries == 0);
if ((err = ext3_ext_get_access(handle, inode, path)))
return err;
@@ -1517,7 +1549,7 @@ static int ext3_remove_blocks(handle_t *
/* tail removal */
unsigned long num, start;
num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
- start = le32_to_cpu(ex->ee_start) + le16_to_cpu(ex->ee_len) - num;
+ start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
ext_debug("free last %lu blocks starting %lu\n", num, start);
for (i = 0; i < num; i++) {
bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1621,7 +1653,7 @@ ext3_ext_rm_leaf(handle_t *handle, struc
if (num == 0) {
/* this extent is removed entirely mark slot unused */
- ex->ee_start = 0;
+ ext3_ext_store_pblock(ex, 0);
eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
}
@@ -1632,8 +1664,8 @@ ext3_ext_rm_leaf(handle_t *handle, struc
if (err)
goto out;
- ext_debug("new extent: %u:%u:%u\n", block, num,
- le32_to_cpu(ex->ee_start));
+ ext_debug("new extent: %u:%u:%llu\n", block, num,
+ ext_pblock(ex));
ex--;
ex_ee_block = le32_to_cpu(ex->ee_block);
ex_ee_len = le16_to_cpu(ex->ee_len);
@@ -1748,11 +1780,11 @@ int ext3_ext_remove_space(struct inode *
path[i].p_idx);
if (ext3_ext_more_to_rm(path + i)) {
/* go to the next level */
- ext_debug("move to level %d (block %d)\n",
- i + 1, le32_to_cpu(path[i].p_idx->ei_leaf));
+ ext_debug("move to level %d (block %llu)\n",
+ i + 1, idx_pblock(path[i].p_idx));
memset(path + i + 1, 0, sizeof(*path));
path[i+1].p_bh =
- sb_bread(sb, le32_to_cpu(path[i].p_idx->ei_leaf));
+ sb_bread(sb, idx_pblock(path[i].p_idx));
if (!path[i+1].p_bh) {
/* should we reset i_size? */
err = -EIO;
@@ -1878,7 +1910,7 @@ int ext3_ext_get_blocks(handle_t *handle
/* block is already allocated */
newblock = iblock
- le32_to_cpu(newex.ee_block)
- + le32_to_cpu(newex.ee_start);
+ + ext_pblock(&newex);
/* number of remain blocks in the extent */
allocated = le16_to_cpu(newex.ee_len) -
(iblock - le32_to_cpu(newex.ee_block));
@@ -1907,7 +1939,7 @@ int ext3_ext_get_blocks(handle_t *handle
if ((ex = path[depth].p_ext)) {
unsigned long ee_block = le32_to_cpu(ex->ee_block);
- unsigned long ee_start = le32_to_cpu(ex->ee_start);
+ unsigned long ee_start = ext_pblock(ex);
unsigned short ee_len = le16_to_cpu(ex->ee_len);
/* if found exent covers block, simple return it */
if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -1943,7 +1975,7 @@ int ext3_ext_get_blocks(handle_t *handle
/* try to insert new extent into found leaf and return */
newex.ee_block = cpu_to_le32(iblock);
- newex.ee_start = cpu_to_le32(newblock);
+ ext3_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext3_ext_insert_extent(handle, inode, path, &newex);
if (err)
@@ -1953,7 +1985,7 @@ int ext3_ext_get_blocks(handle_t *handle
EXT3_I(inode)->i_disksize = inode->i_size;
/* previous routine could use block we allocated */
- newblock = le32_to_cpu(newex.ee_start);
+ newblock = ext_pblock(&newex);
__set_bit(BH_New, &bh_result->b_state);
ext3_ext_put_in_cache(inode, iblock, allocated, newblock,
diff -puN include/linux/ext3_fs_extents.h~ext3-extents-48bit include/linux/ext3_fs_extents.h
--- linux-2.6.17/include/linux/ext3_fs_extents.h~ext3-extents-48bit 2006-06-28 16:46:39.851883223 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_extents.h 2006-06-28 16:46:39.864881731 -0700
@@ -108,7 +108,7 @@ struct ext3_extent_header {
* truncate uses it to simulate recursive walking
*/
struct ext3_ext_path {
- __u32 p_block;
+ __u64 p_block;
__u16 p_depth;
struct ext3_extent *p_ext;
struct ext3_extent_idx *p_idx;
diff -puN include/linux/ext3_fs_i.h~ext3-extents-48bit include/linux/ext3_fs_i.h
--- linux-2.6.17/include/linux/ext3_fs_i.h~ext3-extents-48bit 2006-06-28 16:46:39.855882764 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_i.h 2006-06-28 16:46:39.864881731 -0700
@@ -68,7 +68,7 @@ struct ext3_block_alloc_info {
* storage for cached extent
*/
struct ext3_ext_cache {
- __u32 ec_start;
+ sector_t ec_start;
__u32 ec_block;
__u32 ec_len; /* must be 32bit to return holes */
__u32 ec_type;
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 5/16]block type convert in extents
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (7 preceding siblings ...)
2006-06-30 0:17 ` [RFC][Update][Patch 4/16]support 48 bit blk number in extents Mingming Cao
@ 2006-06-30 0:17 ` Mingming Cao
2006-06-30 0:17 ` [RFC][Update][Patch 6/16]handing unitialized extents Mingming Cao
` (10 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:17 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
convert in-kernel filesystem blocks type to ext3_fsblk_t.
Signed-Off-By: Avantika Mathur <mathur@us.ibm.com>
Acked-By: Alex Tomas <alex@us.ibm.com>
---
linux-2.6.17-ming/fs/ext3/extents.c | 106 +++++++++++-----------
linux-2.6.17-ming/include/linux/ext3_fs_extents.h | 2
linux-2.6.17-ming/include/linux/ext3_fs_i.h | 8 -
3 files changed, 59 insertions(+), 57 deletions(-)
diff -puN fs/ext3/extents.c~ext3-extents-ext3_fsblk_t fs/ext3/extents.c
--- linux-2.6.17/fs/ext3/extents.c~ext3-extents-ext3_fsblk_t 2006-06-28 16:46:45.589224909 -0700
+++ linux-2.6.17-ming/fs/ext3/extents.c 2006-06-28 16:46:45.603223303 -0700
@@ -44,41 +44,41 @@
#include <asm/uaccess.h>
-/* this macro combines low and hi parts of phys. blocknr into sector_t */
-static inline sector_t ext_pblock(struct ext3_extent *ex)
+/* this macro combines low and hi parts of phys. blocknr into ext3_fsblk_t */
+static inline ext3_fsblk_t ext_pblock(struct ext3_extent *ex)
{
- sector_t block;
+ ext3_fsblk_t block;
block = le32_to_cpu(ex->ee_start);
- if (sizeof(sector_t) > 4)
- block |= ((sector_t) le16_to_cpu(ex->ee_start_hi) << 31) << 1;
+ if (sizeof(ext3_fsblk_t) > 4)
+ block |= ((ext3_fsblk_t) le16_to_cpu(ex->ee_start_hi) << 31) << 1;
return block;
}
-/* this macro combines low and hi parts of phys. blocknr into sector_t */
-static inline sector_t idx_pblock(struct ext3_extent_idx *ix)
+/* this macro combines low and hi parts of phys. blocknr into ext3_fsblk_t */
+static inline ext3_fsblk_t idx_pblock(struct ext3_extent_idx *ix)
{
- sector_t block;
+ ext3_fsblk_t block;
block = le32_to_cpu(ix->ei_leaf);
- if (sizeof(sector_t) > 4)
- block |= ((sector_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
+ if (sizeof(ext3_fsblk_t) > 4)
+ block |= ((ext3_fsblk_t) le16_to_cpu(ix->ei_leaf_hi) << 31) << 1;
return block;
}
/* the routine stores large phys. blocknr into extent breaking it into parts */
-static inline void ext3_ext_store_pblock(struct ext3_extent *ex, sector_t pb)
+static inline void ext3_ext_store_pblock(struct ext3_extent *ex, ext3_fsblk_t pb)
{
ex->ee_start = cpu_to_le32((unsigned long) (pb & 0xffffffff));
- if (sizeof(sector_t) > 4)
+ if (sizeof(ext3_fsblk_t) > 4)
ex->ee_start_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
}
/* the routine stores large phys. blocknr into index breaking it into parts */
-static inline void ext3_idx_store_pblock(struct ext3_extent_idx *ix, sector_t pb)
+static inline void ext3_idx_store_pblock(struct ext3_extent_idx *ix, ext3_fsblk_t pb)
{
ix->ei_leaf = cpu_to_le32((unsigned long) (pb & 0xffffffff));
- if (sizeof(sector_t) > 4)
+ if (sizeof(ext3_fsblk_t) > 4)
ix->ei_leaf_hi = cpu_to_le16((unsigned long) ((pb >> 31) >> 1) & 0xffff);
}
@@ -162,13 +162,13 @@ static int ext3_ext_dirty(handle_t *hand
return err;
}
-static int ext3_ext_find_goal(struct inode *inode,
+static ext3_fsblk_t ext3_ext_find_goal(struct inode *inode,
struct ext3_ext_path *path,
- sector_t block)
+ ext3_fsblk_t block)
{
struct ext3_inode_info *ei = EXT3_I(inode);
- unsigned long bg_start;
- unsigned long colour;
+ ext3_fsblk_t bg_start;
+ ext3_grpblk_t colour;
int depth;
if (path) {
@@ -193,12 +193,12 @@ static int ext3_ext_find_goal(struct ino
return bg_start + colour + block;
}
-static int
+static ext3_fsblk_t
ext3_ext_new_block(handle_t *handle, struct inode *inode,
struct ext3_ext_path *path,
struct ext3_extent *ex, int *err)
{
- int goal, newblock;
+ ext3_fsblk_t goal, newblock;
goal = ext3_ext_find_goal(inode, path, le32_to_cpu(ex->ee_block));
newblock = ext3_new_block(handle, inode, goal, err);
@@ -267,10 +267,10 @@ static void ext3_ext_show_path(struct in
ext_debug("path:");
for (k = 0; k <= l; k++, path++) {
if (path->p_idx) {
- ext_debug(" %d->%llu", le32_to_cpu(path->p_idx->ei_block),
+ ext_debug(" %d->"E3FSBLK, le32_to_cpu(path->p_idx->ei_block),
idx_pblock(path->p_idx));
} else if (path->p_ext) {
- ext_debug(" %d:%d:%lld",
+ ext_debug(" %d:%d:"E3FSBLK" ",
le32_to_cpu(path->p_ext->ee_block),
le16_to_cpu(path->p_ext->ee_len),
ext_pblock(path->p_ext));
@@ -294,7 +294,7 @@ static void ext3_ext_show_leaf(struct in
ex = EXT_FIRST_EXTENT(eh);
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
- ext_debug("%d:%d:%lld ", le32_to_cpu(ex->ee_block),
+ ext_debug("%d:%d:"E3FSBLK" ", le32_to_cpu(ex->ee_block),
le16_to_cpu(ex->ee_len), ext_pblock(ex));
}
ext_debug("\n");
@@ -410,7 +410,7 @@ ext3_ext_binsearch(struct inode *inode,
}
path->p_ext = l - 1;
- ext_debug(" -> %d:%lld:%d ",
+ ext_debug(" -> %d:"E3FSBLK":%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
le16_to_cpu(path->p_ext->ee_len));
@@ -525,7 +525,7 @@ err:
*/
static int ext3_ext_insert_index(handle_t *handle, struct inode *inode,
struct ext3_ext_path *curp,
- int logical, int ptr)
+ int logical, ext3_fsblk_t ptr)
{
struct ext3_extent_idx *ix;
int len, err;
@@ -592,9 +592,9 @@ static int ext3_ext_split(handle_t *hand
struct ext3_extent_idx *fidx;
struct ext3_extent *ex;
int i = at, k, m, a;
- unsigned long newblock, oldblock;
+ ext3_fsblk_t newblock, oldblock;
__le32 border;
- int *ablocks = NULL; /* array of allocated blocks */
+ ext3_fsblk_t *ablocks = NULL; /* array of allocated blocks */
int err = 0;
/* make decision: where to split? */
@@ -627,10 +627,10 @@ static int ext3_ext_split(handle_t *hand
* we need this to handle errors and free blocks
* upon them
*/
- ablocks = kmalloc(sizeof(unsigned long) * depth, GFP_NOFS);
+ ablocks = kmalloc(sizeof(ext3_fsblk_t) * depth, GFP_NOFS);
if (!ablocks)
return -ENOMEM;
- memset(ablocks, 0, sizeof(unsigned long) * depth);
+ memset(ablocks, 0, sizeof(ext3_fsblk_t) * depth);
/* allocate all needed blocks */
ext_debug("allocate %d blocks for indexes/leaf\n", depth - at);
@@ -669,7 +669,7 @@ static int ext3_ext_split(handle_t *hand
path[depth].p_ext++;
while (path[depth].p_ext <=
EXT_MAX_EXTENT(path[depth].p_hdr)) {
- ext_debug("move %d:%lld:%d in new leaf %lu\n",
+ ext_debug("move %d:"E3FSBLK":%d in new leaf "E3FSBLK"\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
le16_to_cpu(path[depth].p_ext->ee_len),
@@ -715,7 +715,7 @@ static int ext3_ext_split(handle_t *hand
while (k--) {
oldblock = newblock;
newblock = ablocks[--a];
- bh = sb_getblk(inode->i_sb, newblock);
+ bh = sb_getblk(inode->i_sb, (ext3_fsblk_t)newblock);
if (!bh) {
err = -EIO;
goto cleanup;
@@ -734,7 +734,7 @@ static int ext3_ext_split(handle_t *hand
fidx->ei_block = border;
ext3_idx_store_pblock(fidx, oldblock);
- ext_debug("int.index at %d (block %lu): %lu -> %lu\n", i,
+ ext_debug("int.index at %d (block "E3FSBLK"): %lu -> "E3FSBLK"\n", i,
newblock, (unsigned long) le32_to_cpu(border),
oldblock);
/* copy indexes */
@@ -746,7 +746,7 @@ static int ext3_ext_split(handle_t *hand
BUG_ON(EXT_MAX_INDEX(path[i].p_hdr) !=
EXT_LAST_INDEX(path[i].p_hdr));
while (path[i].p_idx <= EXT_MAX_INDEX(path[i].p_hdr)) {
- ext_debug("%d: move %d:%d in new index %llu\n", i,
+ ext_debug("%d: move %d:%d in new index "E3FSBLK"\n", i,
le32_to_cpu(path[i].p_idx->ei_block),
idx_pblock(path[i].p_idx),
newblock);
@@ -827,7 +827,7 @@ static int ext3_ext_grow_indepth(handle_
struct ext3_extent_header *neh;
struct ext3_extent_idx *fidx;
struct buffer_head *bh;
- unsigned long newblock;
+ ext3_fsblk_t newblock;
int err = 0;
newblock = ext3_ext_new_block(handle, inode, path, newext, &err);
@@ -879,7 +879,7 @@ static int ext3_ext_grow_indepth(handle_
neh = ext_inode_hdr(inode);
fidx = EXT_FIRST_INDEX(neh);
- ext_debug("new root: num %d(%d), lblock %d, ptr %llu\n",
+ ext_debug("new root: num %d(%d), lblock %d, ptr "E3FSBLK"\n",
le16_to_cpu(neh->eh_entries), le16_to_cpu(neh->eh_max),
le32_to_cpu(fidx->ei_block), idx_pblock(fidx));
@@ -1114,7 +1114,7 @@ int ext3_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex && ext3_can_extents_be_merged(inode, ex, newext)) {
- ext_debug("append %d block to %d:%d (from %lld)\n",
+ ext_debug("append %d block to %d:%d (from "E3FSBLK")\n",
le16_to_cpu(newext->ee_len),
le32_to_cpu(ex->ee_block),
le16_to_cpu(ex->ee_len), ext_pblock(ex));
@@ -1173,7 +1173,7 @@ has_space:
if (!nearex) {
/* there is no extent in this leaf, create first one */
- ext_debug("first extent in the leaf: %d:%lld:%d\n",
+ ext_debug("first extent in the leaf: %d:"E3FSBLK":%d\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
le16_to_cpu(newext->ee_len));
@@ -1185,7 +1185,7 @@ has_space:
len = EXT_MAX_EXTENT(eh) - nearex;
len = (len - 1) * sizeof(struct ext3_extent);
len = len < 0 ? 0 : len;
- ext_debug("insert %d:%lld:%d after: nearest 0x%p, "
+ ext_debug("insert %d:"E3FSBLK":%d after: nearest 0x%p, "
"move %d from 0x%p to 0x%p\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
@@ -1198,7 +1198,7 @@ has_space:
BUG_ON(newext->ee_block == nearex->ee_block);
len = (EXT_MAX_EXTENT(eh) - nearex) * sizeof(struct ext3_extent);
len = len < 0 ? 0 : len;
- ext_debug("insert %d:%lld:%d before: nearest 0x%p, "
+ ext_debug("insert %d:"E3FSBLK":%d before: nearest 0x%p, "
"move %d from 0x%p to 0x%p\n",
le32_to_cpu(newext->ee_block),
ext_pblock(newext),
@@ -1432,11 +1432,11 @@ ext3_ext_in_cache(struct inode *inode, u
ex->ee_block = cpu_to_le32(cex->ec_block);
ext3_ext_store_pblock(ex, cex->ec_start);
ex->ee_len = cpu_to_le16(cex->ec_len);
- ext_debug("%lu cached by %lu:%lu:%lu\n",
+ ext_debug("%lu cached by %lu:%lu:"E3FSBLK"\n",
(unsigned long) block,
(unsigned long) cex->ec_block,
(unsigned long) cex->ec_len,
- (unsigned long) cex->ec_start);
+ cex->ec_start);
return cex->ec_type;
}
@@ -1454,7 +1454,7 @@ int ext3_ext_rm_idx(handle_t *handle, st
{
struct buffer_head *bh;
int err;
- unsigned long leaf;
+ ext3_fsblk_t leaf;
/* free index block */
path--;
@@ -1465,7 +1465,7 @@ int ext3_ext_rm_idx(handle_t *handle, st
path->p_hdr->eh_entries = cpu_to_le16(le16_to_cpu(path->p_hdr->eh_entries)-1);
if ((err = ext3_ext_dirty(handle, inode, path)))
return err;
- ext_debug("index is empty, remove it, free block %lu\n", leaf);
+ ext_debug("index is empty, remove it, free block "E3FSBLK"\n", leaf);
bh = sb_find_get_block(inode->i_sb, leaf);
ext3_forget(handle, 1, inode, bh, leaf);
ext3_free_blocks(handle, inode, leaf, 1);
@@ -1547,10 +1547,11 @@ static int ext3_remove_blocks(handle_t *
if (from >= le32_to_cpu(ex->ee_block)
&& to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
/* tail removal */
- unsigned long num, start;
+ unsigned long num;
+ ext3_fsblk_t start;
num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
- ext_debug("free last %lu blocks starting %lu\n", num, start);
+ ext_debug("free last %lu blocks starting "E3FSBLK"\n", num, start);
for (i = 0; i < num; i++) {
bh = sb_find_get_block(inode->i_sb, start + i);
ext3_forget(handle, 0, inode, bh, start + i);
@@ -1664,7 +1665,7 @@ ext3_ext_rm_leaf(handle_t *handle, struc
if (err)
goto out;
- ext_debug("new extent: %u:%u:%llu\n", block, num,
+ ext_debug("new extent: %u:%u:"E3FSBLK"\n", block, num,
ext_pblock(ex));
ex--;
ex_ee_block = le32_to_cpu(ex->ee_block);
@@ -1780,7 +1781,7 @@ int ext3_ext_remove_space(struct inode *
path[i].p_idx);
if (ext3_ext_more_to_rm(path + i)) {
/* go to the next level */
- ext_debug("move to level %d (block %llu)\n",
+ ext_debug("move to level %d (block "E3FSBLK")\n",
i + 1, idx_pblock(path[i].p_idx));
memset(path + i + 1, 0, sizeof(*path));
path[i+1].p_bh =
@@ -1883,13 +1884,14 @@ void ext3_ext_release(struct super_block
#endif
}
-int ext3_ext_get_blocks(handle_t *handle, struct inode *inode, sector_t iblock,
+int ext3_ext_get_blocks(handle_t *handle, struct inode *inode, ext3_fsblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
int create, int extend_disksize)
{
struct ext3_ext_path *path = NULL;
struct ext3_extent newex, *ex;
- int goal, newblock, err = 0, depth;
+ ext3_fsblk_t goal, newblock;
+ int err = 0, depth;
unsigned long allocated = 0;
__clear_bit(BH_New, &bh_result->b_state);
@@ -1939,14 +1941,14 @@ int ext3_ext_get_blocks(handle_t *handle
if ((ex = path[depth].p_ext)) {
unsigned long ee_block = le32_to_cpu(ex->ee_block);
- unsigned long ee_start = ext_pblock(ex);
+ ext3_fsblk_t ee_start = ext_pblock(ex);
unsigned short ee_len = le16_to_cpu(ex->ee_len);
/* if found exent covers block, simple return it */
if (iblock >= ee_block && iblock < ee_block + ee_len) {
newblock = iblock - ee_block + ee_start;
/* number of remain blocks in the extent */
allocated = ee_len - (iblock - ee_block);
- ext_debug("%d fit into %lu:%d -> %d\n", (int) iblock,
+ ext_debug("%d fit into %lu:%d -> "E3FSBLK"\n", (int) iblock,
ee_block, ee_len, newblock);
ext3_ext_put_in_cache(inode, ee_block, ee_len,
ee_start, EXT3_EXT_CACHE_EXTENT);
@@ -1970,7 +1972,7 @@ int ext3_ext_get_blocks(handle_t *handle
newblock = ext3_new_blocks(handle, inode, goal, &allocated, &err);
if (!newblock)
goto out2;
- ext_debug("allocate new block: goal %d, found %d/%lu\n",
+ ext_debug("allocate new block: goal "E3FSBLK", found "E3FSBLK"/%lu\n",
goal, newblock, allocated);
/* try to insert new extent into found leaf and return */
diff -puN include/linux/ext3_fs_extents.h~ext3-extents-ext3_fsblk_t include/linux/ext3_fs_extents.h
--- linux-2.6.17/include/linux/ext3_fs_extents.h~ext3-extents-ext3_fsblk_t 2006-06-28 16:46:45.592224565 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_extents.h 2006-06-28 16:46:45.604223188 -0700
@@ -108,7 +108,7 @@ struct ext3_extent_header {
* truncate uses it to simulate recursive walking
*/
struct ext3_ext_path {
- __u64 p_block;
+ ext3_fsblk_t p_block;
__u16 p_depth;
struct ext3_extent *p_ext;
struct ext3_extent_idx *p_idx;
diff -puN include/linux/ext3_fs_i.h~ext3-extents-ext3_fsblk_t include/linux/ext3_fs_i.h
--- linux-2.6.17/include/linux/ext3_fs_i.h~ext3-extents-ext3_fsblk_t 2006-06-28 16:46:45.596224106 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_i.h 2006-06-28 16:46:45.604223188 -0700
@@ -68,10 +68,10 @@ struct ext3_block_alloc_info {
* storage for cached extent
*/
struct ext3_ext_cache {
- sector_t ec_start;
- __u32 ec_block;
- __u32 ec_len; /* must be 32bit to return holes */
- __u32 ec_type;
+ ext3_fsblk_t ec_start;
+ __u32 ec_block;
+ __u32 ec_len; /* must be 32bit to return holes */
+ __u32 ec_type;
};
/*
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 6/16]handing unitialized extents
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (8 preceding siblings ...)
2006-06-30 0:17 ` [RFC][Update][Patch 5/16]block type convert " Mingming Cao
@ 2006-06-30 0:17 ` Mingming Cao
2006-06-30 0:17 ` [RFC][Update][Patch 7/16]Core 64 bit JBD changes Mingming Cao
` (9 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:17 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
Make it possible to add file preallocation support in future as an
RO_COMPAT feature by recognizing uninitialized extents as holes and
limiting extent length to keep the top bit of ee_len free for marking
uninitialized extents.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
linux-2.6.17-ming/fs/ext3/extents.c | 16 ++++++++++++++++
linux-2.6.17-ming/include/linux/ext3_fs_extents.h | 2 ++
2 files changed, 18 insertions(+)
diff -puN fs/ext3/extents.c~ext3-unitialized-extent-handling fs/ext3/extents.c
--- linux-2.6.17/fs/ext3/extents.c~ext3-unitialized-extent-handling 2006-06-28 16:46:49.657758078 -0700
+++ linux-2.6.17-ming/fs/ext3/extents.c 2006-06-28 16:46:49.667756930 -0700
@@ -1082,6 +1082,13 @@ ext3_can_extents_be_merged(struct inode
!= le32_to_cpu(ex2->ee_block))
return 0;
+ /*
+ * To allow future support for preallocated extents to be added
+ * as an RO_COMPAT feature, refuse to merge to extents if
+ * can result in the top bit of ee_len being set
+ */
+ if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+ return 0;
#ifdef AGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
return 0;
@@ -1943,6 +1950,15 @@ int ext3_ext_get_blocks(handle_t *handle
unsigned long ee_block = le32_to_cpu(ex->ee_block);
ext3_fsblk_t ee_start = ext_pblock(ex);
unsigned short ee_len = le16_to_cpu(ex->ee_len);
+
+ /*
+ * Allow future support for preallocated extents to be added
+ * as an RO_COMPAT feature:
+ * Uninitialized extents are treated as holes, except that
+ * we avoid (fail) allocating new blocks during a write.
+ */
+ if (ee_len > EXT_MAX_LEN)
+ goto out2;
/* if found exent covers block, simple return it */
if (iblock >= ee_block && iblock < ee_block + ee_len) {
newblock = iblock - ee_block + ee_start;
diff -puN include/linux/ext3_fs_extents.h~ext3-unitialized-extent-handling include/linux/ext3_fs_extents.h
--- linux-2.6.17/include/linux/ext3_fs_extents.h~ext3-unitialized-extent-handling 2006-06-28 16:46:49.661757619 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs_extents.h 2006-06-28 16:46:49.668756816 -0700
@@ -141,6 +141,8 @@ typedef int (*ext_prepare_callback)(stru
#define EXT_MAX_BLOCK 0xffffffff
+#define EXT_MAX_LEN ((1UL << 15) - 1)
+
#define EXT_FIRST_EXTENT(__hdr__) \
((struct ext3_extent *) (((char *) (__hdr__)) + \
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 7/16]Core 64 bit JBD changes
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (9 preceding siblings ...)
2006-06-30 0:17 ` [RFC][Update][Patch 6/16]handing unitialized extents Mingming Cao
@ 2006-06-30 0:17 ` Mingming Cao
2006-06-30 0:18 ` [RFC][Update][Patch 8/16]Avoid potential block overflow when writing journal metadata tags Mingming Cao
` (8 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:17 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
Here is the patch to JBD to handle 64 bit block numbers, originally
from Zach Brown. This patch is useful only after adding support for
64-bit block numbers in the filesystem.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Zach Brown <zach.brown@oracle.com>
---
linux-2.6.17-ming/fs/jbd/commit.c | 16 +++++++++---
linux-2.6.17-ming/fs/jbd/journal.c | 11 ++++++++
linux-2.6.17-ming/fs/jbd/recovery.c | 42 +++++++++++++++++++++++-----------
linux-2.6.17-ming/fs/jbd/revoke.c | 14 ++++++++---
linux-2.6.17-ming/include/linux/jbd.h | 11 +++++++-
5 files changed, 72 insertions(+), 22 deletions(-)
diff -puN fs/jbd/commit.c~64bit_jbd_core fs/jbd/commit.c
--- linux-2.6.17/fs/jbd/commit.c~64bit_jbd_core 2006-06-28 16:46:53.936267153 -0700
+++ linux-2.6.17-ming/fs/jbd/commit.c 2006-06-28 16:46:53.953265203 -0700
@@ -160,6 +160,12 @@ static int journal_write_commit_record(j
return (ret == -EIO);
}
+static inline void write_split_be64(__be32 *high, __be32 *low, u64 val)
+{
+ *low = cpu_to_be32(val & (u32)~0);
+ *high = cpu_to_be32(val >> 32);
+}
+
/*
* journal_commit_transaction
*
@@ -182,6 +188,7 @@ void journal_commit_transaction(journal_
int first_tag = 0;
int tag_flag;
int i;
+ int tag_bytes = journal_tag_bytes(journal);
/*
* First job: lock down the current transaction and wait for
@@ -553,10 +560,11 @@ write_out_data:
tag_flag |= JFS_FLAG_SAME_UUID;
tag = (journal_block_tag_t *) tagp;
- tag->t_blocknr = cpu_to_be32(jh2bh(jh)->b_blocknr);
+ write_split_be64(&tag->t_blocknr_high, &tag->t_blocknr,
+ jh2bh(jh)->b_blocknr);
tag->t_flags = cpu_to_be32(tag_flag);
- tagp += sizeof(journal_block_tag_t);
- space_left -= sizeof(journal_block_tag_t);
+ tagp += tag_bytes;
+ space_left -= tag_bytes;
if (first_tag) {
memcpy (tagp, journal->j_uuid, 16);
@@ -570,7 +578,7 @@ write_out_data:
if (bufs == journal->j_wbufsize ||
commit_transaction->t_buffers == NULL ||
- space_left < sizeof(journal_block_tag_t) + 16) {
+ space_left < tag_bytes + 16) {
jbd_debug(4, "JBD: Submit %d IOs\n", bufs);
diff -puN fs/jbd/journal.c~64bit_jbd_core fs/jbd/journal.c
--- linux-2.6.17/fs/jbd/journal.c~64bit_jbd_core 2006-06-28 16:46:53.939266809 -0700
+++ linux-2.6.17-ming/fs/jbd/journal.c 2006-06-28 16:46:53.956264859 -0700
@@ -1603,6 +1603,17 @@ int journal_blocks_per_page(struct inode
}
/*
+ * helper functions to deal with 32 or 64bit block numbers.
+ */
+size_t journal_tag_bytes(journal_t *journal)
+{
+ if (JFS_HAS_INCOMPAT_FEATURE(journal, JFS_FEATURE_INCOMPAT_64BIT))
+ return sizeof(journal_block_tag_t);
+ else
+ return offsetof(journal_block_tag_t, t_blocknr_high);
+}
+
+/*
* Simple support for retrying memory allocations. Introduced to help to
* debug different VM deadlock avoidance strategies.
*/
diff -puN fs/jbd/recovery.c~64bit_jbd_core fs/jbd/recovery.c
--- linux-2.6.17/fs/jbd/recovery.c~64bit_jbd_core 2006-06-28 16:46:53.942266465 -0700
+++ linux-2.6.17-ming/fs/jbd/recovery.c 2006-06-28 16:46:53.957264744 -0700
@@ -178,19 +178,20 @@ static int jread(struct buffer_head **bh
* Count the number of in-use tags in a journal descriptor block.
*/
-static int count_tags(struct buffer_head *bh, int size)
+static int count_tags(journal_t *journal, struct buffer_head *bh)
{
char * tagp;
journal_block_tag_t * tag;
- int nr = 0;
+ int nr = 0, size = journal->j_blocksize;
+ int tag_bytes = journal_tag_bytes(journal);
tagp = &bh->b_data[sizeof(journal_header_t)];
- while ((tagp - bh->b_data + sizeof(journal_block_tag_t)) <= size) {
+ while ((tagp - bh->b_data + tag_bytes) <= size) {
tag = (journal_block_tag_t *) tagp;
nr++;
- tagp += sizeof(journal_block_tag_t);
+ tagp += tag_bytes;
if (!(tag->t_flags & cpu_to_be32(JFS_FLAG_SAME_UUID)))
tagp += 16;
@@ -307,6 +308,13 @@ int journal_skip_recovery(journal_t *jou
return err;
}
+static inline u64 read_split_be64(__be32 *high, __be32 *low)
+{
+ u64 ret = be32_to_cpu(*low);
+ ret |= (u64)be32_to_cpu(*high) << 32;
+ return ret;
+}
+
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
@@ -318,11 +326,12 @@ static int do_one_pass(journal_t *journa
struct buffer_head * bh;
unsigned int sequence;
int blocktype;
+ int tag_bytes = journal_tag_bytes(journal);
/* Precompute the maximum metadata descriptors in a descriptor block */
int MAX_BLOCKS_PER_DESC;
MAX_BLOCKS_PER_DESC = ((journal->j_blocksize-sizeof(journal_header_t))
- / sizeof(journal_block_tag_t));
+ / tag_bytes);
/*
* First thing is to establish what we expect to find in the log
@@ -412,8 +421,7 @@ static int do_one_pass(journal_t *journa
* in pass REPLAY; otherwise, just skip over the
* blocks it describes. */
if (pass != PASS_REPLAY) {
- next_log_block +=
- count_tags(bh, journal->j_blocksize);
+ next_log_block += count_tags(journal, bh);
wrap(journal, next_log_block);
brelse(bh);
continue;
@@ -424,7 +432,7 @@ static int do_one_pass(journal_t *journa
* getting done here! */
tagp = &bh->b_data[sizeof(journal_header_t)];
- while ((tagp - bh->b_data +sizeof(journal_block_tag_t))
+ while ((tagp - bh->b_data + tag_bytes)
<= journal->j_blocksize) {
unsigned long io_block;
@@ -446,7 +454,8 @@ static int do_one_pass(journal_t *journa
unsigned long blocknr;
J_ASSERT(obh != NULL);
- blocknr = be32_to_cpu(tag->t_blocknr);
+ blocknr = read_split_be64(&tag->t_blocknr_high,
+ &tag->t_blocknr);
/* If the block has been
* revoked, then we're all done
@@ -494,7 +503,7 @@ static int do_one_pass(journal_t *journa
}
skip_write:
- tagp += sizeof(journal_block_tag_t);
+ tagp += tag_bytes;
if (!(flags & JFS_FLAG_SAME_UUID))
tagp += 16;
@@ -572,17 +581,24 @@ static int scan_revoke_records(journal_t
{
journal_revoke_header_t *header;
int offset, max;
+ int record_len = 4;
header = (journal_revoke_header_t *) bh->b_data;
offset = sizeof(journal_revoke_header_t);
max = be32_to_cpu(header->r_count);
- while (offset < max) {
+ if (JFS_HAS_INCOMPAT_FEATURE(journal, JFS_FEATURE_INCOMPAT_64BIT))
+ record_len = 8;
+
+ while (offset + record_len < max) {
unsigned long blocknr;
int err;
- blocknr = be32_to_cpu(* ((__be32 *) (bh->b_data+offset)));
- offset += 4;
+ if (record_len == 4)
+ blocknr = be32_to_cpu(* ((__be32 *) (bh->b_data+offset)));
+ else
+ blocknr = be64_to_cpu(* ((__be64 *) (bh->b_data+offset)));
+ offset += record_len;
err = journal_set_revoke(journal, blocknr, sequence);
if (err)
return err;
diff -puN fs/jbd/revoke.c~64bit_jbd_core fs/jbd/revoke.c
--- linux-2.6.17/fs/jbd/revoke.c~64bit_jbd_core 2006-06-28 16:46:53.945266121 -0700
+++ linux-2.6.17-ming/fs/jbd/revoke.c 2006-06-28 16:46:53.959264514 -0700
@@ -584,9 +584,17 @@ static void write_one_revoke_record(jour
*descriptorp = descriptor;
}
- * ((__be32 *)(&jh2bh(descriptor)->b_data[offset])) =
- cpu_to_be32(record->blocknr);
- offset += 4;
+ if (JFS_HAS_INCOMPAT_FEATURE(journal, JFS_FEATURE_INCOMPAT_64BIT)) {
+ * ((__be64 *)(&jh2bh(descriptor)->b_data[offset])) =
+ cpu_to_be64(record->blocknr);
+ offset += 8;
+
+ } else {
+ * ((__be32 *)(&jh2bh(descriptor)->b_data[offset])) =
+ cpu_to_be32(record->blocknr);
+ offset += 4;
+ }
+
*offsetp = offset;
}
diff -puN include/linux/jbd.h~64bit_jbd_core include/linux/jbd.h
--- linux-2.6.17/include/linux/jbd.h~64bit_jbd_core 2006-06-28 16:46:53.949265662 -0700
+++ linux-2.6.17-ming/include/linux/jbd.h 2006-06-28 16:46:53.961264285 -0700
@@ -147,12 +147,16 @@ typedef struct journal_header_s
/*
- * The block tag: used to describe a single buffer in the journal
+ * The block tag: used to describe a single buffer in the journal.
+ * t_blocknr_high is only used if INCOMPAT_64BIT is set, so this
+ * raw struct shouldn't be used for pointer math or sizeof() - use
+ * journal_tag_bytes(journal) instead to compute this.
*/
typedef struct journal_block_tag_s
{
__be32 t_blocknr; /* The on-disk block number */
__be32 t_flags; /* See below */
+ __be32 t_blocknr_high; /* most-significant high 32bits. */
} journal_block_tag_t;
/*
@@ -232,11 +236,13 @@ typedef struct journal_superblock_s
((j)->j_superblock->s_feature_incompat & cpu_to_be32((mask))))
#define JFS_FEATURE_INCOMPAT_REVOKE 0x00000001
+#define JFS_FEATURE_INCOMPAT_64BIT 0x00000002
/* Features known to this kernel version: */
#define JFS_KNOWN_COMPAT_FEATURES 0
#define JFS_KNOWN_ROCOMPAT_FEATURES 0
-#define JFS_KNOWN_INCOMPAT_FEATURES JFS_FEATURE_INCOMPAT_REVOKE
+#define JFS_KNOWN_INCOMPAT_FEATURES (JFS_FEATURE_INCOMPAT_REVOKE | \
+ JFS_FEATURE_INCOMPAT_64BIT)
#ifdef __KERNEL__
@@ -1050,6 +1056,7 @@ static inline int tid_geq(tid_t x, tid_t
}
extern int journal_blocks_per_page(struct inode *inode);
+extern size_t journal_tag_bytes(journal_t *journal);
/*
* Return the minimum number of blocks which must be free in the journal
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 8/16]Avoid potential block overflow when writing journal metadata tags
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (10 preceding siblings ...)
2006-06-30 0:17 ` [RFC][Update][Patch 7/16]Core 64 bit JBD changes Mingming Cao
@ 2006-06-30 0:18 ` Mingming Cao
2006-06-30 0:18 ` [RFC][Update][Patch 9/16]Fix reading of 32-bit tag descriptors Mingming Cao
` (7 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:18 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
When writing block numbers into a journal descriptor block, don't write
the top 32 bits of a tag unless we're using a 64-bit journal. That
avoids any possibility of overflowing off the end of the descriptor
block in the case where the last 32-bit tag only just fits into the
descriptor block.
Also cleans up the tag handling slightly by introducing new macros for
the size of 32- and 64-bit descriptor tags.
Signed-off-by: Stephen Tweedie <sct@redhat.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>
---
linux-2.6.17-ming/fs/jbd/commit.c | 11 ++++++-----
linux-2.6.17-ming/include/linux/jbd.h | 3 +++
2 files changed, 9 insertions(+), 5 deletions(-)
diff -puN fs/jbd/commit.c~jbd-avoid-blk-overflow-write-journal-metadata-tag fs/jbd/commit.c
--- linux-2.6.17/fs/jbd/commit.c~jbd-avoid-blk-overflow-write-journal-metadata-tag 2006-06-28 16:46:58.783710948 -0700
+++ linux-2.6.17-ming/fs/jbd/commit.c 2006-06-28 16:46:58.791710030 -0700
@@ -160,10 +160,12 @@ static int journal_write_commit_record(j
return (ret == -EIO);
}
-static inline void write_split_be64(__be32 *high, __be32 *low, u64 val)
+static inline void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
+ sector_t block)
{
- *low = cpu_to_be32(val & (u32)~0);
- *high = cpu_to_be32(val >> 32);
+ tag->t_blocknr = cpu_to_be32(block & (u32)~0);
+ if (tag_bytes > JBD_TAG_SIZE32)
+ tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1);
}
/*
@@ -560,8 +562,7 @@ write_out_data:
tag_flag |= JFS_FLAG_SAME_UUID;
tag = (journal_block_tag_t *) tagp;
- write_split_be64(&tag->t_blocknr_high, &tag->t_blocknr,
- jh2bh(jh)->b_blocknr);
+ write_tag_block(tag_bytes, tag, jh2bh(jh)->b_blocknr);
tag->t_flags = cpu_to_be32(tag_flag);
tagp += tag_bytes;
space_left -= tag_bytes;
diff -puN include/linux/jbd.h~jbd-avoid-blk-overflow-write-journal-metadata-tag include/linux/jbd.h
--- linux-2.6.17/include/linux/jbd.h~jbd-avoid-blk-overflow-write-journal-metadata-tag 2006-06-28 16:46:58.787710489 -0700
+++ linux-2.6.17-ming/include/linux/jbd.h 2006-06-28 16:46:58.793709801 -0700
@@ -159,6 +159,9 @@ typedef struct journal_block_tag_s
__be32 t_blocknr_high; /* most-significant high 32bits. */
} journal_block_tag_t;
+#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
+#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t))
+
/*
* The revoke descriptor: used on disk to describe a series of blocks to
* be revoked from the log
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 9/16]Fix reading of 32-bit tag descriptors
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (11 preceding siblings ...)
2006-06-30 0:18 ` [RFC][Update][Patch 8/16]Avoid potential block overflow when writing journal metadata tags Mingming Cao
@ 2006-06-30 0:18 ` Mingming Cao
2006-06-30 0:18 ` [RFC][Update][Patch 10/16]Cleanup journal_tag_bytes() Mingming Cao
` (6 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:18 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
We must never attempt to read the high 32-bits of a descriptor tag on
a 32-bit journal, even when CONFIG_LBD is set, as we'll end up reading
garbage from the subsequent tag.
Signed-off-by: Stephen Tweedie <sct@redhat.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>
---
linux-2.6.17-ming/fs/jbd/recovery.c | 13 +++++++------
1 files changed, 7 insertions(+), 6 deletions(-)
diff -puN fs/jbd/recovery.c~jbd-read-32bit-tag-fix fs/jbd/recovery.c
--- linux-2.6.17/fs/jbd/recovery.c~jbd-read-32bit-tag-fix 2006-06-28 16:47:02.555278191 -0700
+++ linux-2.6.17-ming/fs/jbd/recovery.c 2006-06-28 16:47:02.558277847 -0700
@@ -308,11 +308,12 @@ int journal_skip_recovery(journal_t *jou
return err;
}
-static inline u64 read_split_be64(__be32 *high, __be32 *low)
+static inline sector_t read_tag_block(int tag_bytes, journal_block_tag_t *tag)
{
- u64 ret = be32_to_cpu(*low);
- ret |= (u64)be32_to_cpu(*high) << 32;
- return ret;
+ sector_t block = be32_to_cpu(tag->t_blocknr);
+ if (tag_bytes > JBD_TAG_SIZE32)
+ block |= (u64)be32_to_cpu(tag->t_blocknr_high) << 32;
+ return block;
}
static int do_one_pass(journal_t *journal,
@@ -454,8 +455,8 @@ static int do_one_pass(journal_t *journa
unsigned long blocknr;
J_ASSERT(obh != NULL);
- blocknr = read_split_be64(&tag->t_blocknr_high,
- &tag->t_blocknr);
+ blocknr = read_tag_block(tag_bytes,
+ tag);
/* If the block has been
* revoked, then we're all done
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 10/16]Cleanup journal_tag_bytes()
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (12 preceding siblings ...)
2006-06-30 0:18 ` [RFC][Update][Patch 9/16]Fix reading of 32-bit tag descriptors Mingming Cao
@ 2006-06-30 0:18 ` Mingming Cao
2006-06-30 0:18 ` [RFC][Update][Patch 11/16]JBD layer in-kernel block variables type fixes Mingming Cao
` (5 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:18 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
Cleanup journal_tag_bytes() to use the new JBD_TAG_SIZE* macros.
Signed-off-by: Stephen Tweedie <sct@redhat.com>
Acked-by: Badari Pulavarty <pbadari@us.ibm.com>
---
linux-2.6.17-ming/fs/jbd/journal.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff -puN fs/jbd/journal.c~jbd-cleanup-journal_tag_bytes fs/jbd/journal.c
--- linux-2.6.17/fs/jbd/journal.c~jbd-cleanup-journal_tag_bytes 2006-06-28 16:47:05.112984715 -0700
+++ linux-2.6.17-ming/fs/jbd/journal.c 2006-06-28 16:47:05.117984141 -0700
@@ -1608,9 +1608,9 @@ int journal_blocks_per_page(struct inode
size_t journal_tag_bytes(journal_t *journal)
{
if (JFS_HAS_INCOMPAT_FEATURE(journal, JFS_FEATURE_INCOMPAT_64BIT))
- return sizeof(journal_block_tag_t);
+ return JBD_TAG_SIZE64;
else
- return offsetof(journal_block_tag_t, t_blocknr_high);
+ return JBD_TAG_SIZE32;
}
/*
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 11/16]JBD layer in-kernel block variables type fixes
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (13 preceding siblings ...)
2006-06-30 0:18 ` [RFC][Update][Patch 10/16]Cleanup journal_tag_bytes() Mingming Cao
@ 2006-06-30 0:18 ` Mingming Cao
2006-06-30 0:18 ` [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code Mingming Cao
` (4 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:18 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
JBD layer in-kernel block varibles type fixes to support >32 bit block number
and convert to sector_t type.
Signed-Off-By: Mingming Cao <cmm@us.ibm.com>
---
---
linux-2.6.17-ming/fs/jbd/commit.c | 2 +-
linux-2.6.17-ming/fs/jbd/journal.c | 16 ++++++++--------
linux-2.6.17-ming/fs/jbd/recovery.c | 8 ++++----
linux-2.6.17-ming/fs/jbd/revoke.c | 24 +++++++++++++-----------
linux-2.6.17-ming/include/linux/ext3_jbd.h | 2 +-
linux-2.6.17-ming/include/linux/jbd.h | 17 ++++++++---------
6 files changed, 35 insertions(+), 34 deletions(-)
diff -puN fs/jbd/commit.c~sector_t-jbd fs/jbd/commit.c
--- linux-2.6.17/fs/jbd/commit.c~sector_t-jbd 2006-06-28 16:47:07.568702941 -0700
+++ linux-2.6.17-ming/fs/jbd/commit.c 2006-06-28 16:47:07.590700417 -0700
@@ -182,7 +182,7 @@ void journal_commit_transaction(journal_
int bufs;
int flags;
int err;
- unsigned long blocknr;
+ sector_t blocknr;
char *tagp = NULL;
journal_header_t *header;
journal_block_tag_t *tag = NULL;
diff -puN fs/jbd/journal.c~sector_t-jbd fs/jbd/journal.c
--- linux-2.6.17/fs/jbd/journal.c~sector_t-jbd 2006-06-28 16:47:07.572702482 -0700
+++ linux-2.6.17-ming/fs/jbd/journal.c 2006-06-28 16:47:07.593700073 -0700
@@ -270,7 +270,7 @@ static void journal_kill_thread(journal_
int journal_write_metadata_buffer(transaction_t *transaction,
struct journal_head *jh_in,
struct journal_head **jh_out,
- int blocknr)
+ sector_t blocknr)
{
int need_copy_out = 0;
int done_copy_out = 0;
@@ -554,7 +554,7 @@ int log_wait_commit(journal_t *journal,
* Log buffer allocation routines:
*/
-int journal_next_log_block(journal_t *journal, unsigned long *retp)
+int journal_next_log_block(journal_t *journal, sector_t *retp)
{
unsigned long blocknr;
@@ -578,10 +578,10 @@ int journal_next_log_block(journal_t *jo
* ready.
*/
int journal_bmap(journal_t *journal, unsigned long blocknr,
- unsigned long *retp)
+ sector_t *retp)
{
int err = 0;
- unsigned long ret;
+ sector_t ret;
if (journal->j_inode) {
ret = bmap(journal->j_inode, blocknr);
@@ -617,7 +617,7 @@ int journal_bmap(journal_t *journal, uns
struct journal_head *journal_get_descriptor_buffer(journal_t *journal)
{
struct buffer_head *bh;
- unsigned long blocknr;
+ sector_t blocknr;
int err;
err = journal_next_log_block(journal, &blocknr);
@@ -705,7 +705,7 @@ fail:
*/
journal_t * journal_init_dev(struct block_device *bdev,
struct block_device *fs_dev,
- int start, int len, int blocksize)
+ sector_t start, int len, int blocksize)
{
journal_t *journal = journal_init_common();
struct buffer_head *bh;
@@ -753,7 +753,7 @@ journal_t * journal_init_inode (struct i
journal_t *journal = journal_init_common();
int err;
int n;
- unsigned long blocknr;
+ sector_t blocknr;
if (!journal)
return NULL;
@@ -853,7 +853,7 @@ static int journal_reset(journal_t *jour
**/
int journal_create(journal_t *journal)
{
- unsigned long blocknr;
+ sector_t blocknr;
struct buffer_head *bh;
journal_superblock_t *sb;
int i, err;
diff -puN fs/jbd/recovery.c~sector_t-jbd fs/jbd/recovery.c
--- linux-2.6.17/fs/jbd/recovery.c~sector_t-jbd 2006-06-28 16:47:07.575702138 -0700
+++ linux-2.6.17-ming/fs/jbd/recovery.c 2006-06-28 16:47:07.595699844 -0700
@@ -70,7 +70,7 @@ static int do_readahead(journal_t *journ
{
int err;
unsigned int max, nbufs, next;
- unsigned long blocknr;
+ sector_t blocknr;
struct buffer_head *bh;
struct buffer_head * bufs[MAXBUF];
@@ -132,7 +132,7 @@ static int jread(struct buffer_head **bh
unsigned int offset)
{
int err;
- unsigned long blocknr;
+ sector_t blocknr;
struct buffer_head *bh;
*bhp = NULL;
@@ -452,7 +452,7 @@ static int do_one_pass(journal_t *journa
"block %ld in log\n",
err, io_block);
} else {
- unsigned long blocknr;
+ sector_t blocknr;
J_ASSERT(obh != NULL);
blocknr = read_tag_block(tag_bytes,
@@ -592,7 +592,7 @@ static int scan_revoke_records(journal_t
record_len = 8;
while (offset + record_len < max) {
- unsigned long blocknr;
+ sector_t blocknr;
int err;
if (record_len == 4)
diff -puN fs/jbd/revoke.c~sector_t-jbd fs/jbd/revoke.c
--- linux-2.6.17/fs/jbd/revoke.c~sector_t-jbd 2006-06-28 16:47:07.578701794 -0700
+++ linux-2.6.17-ming/fs/jbd/revoke.c 2006-06-28 16:47:07.596699729 -0700
@@ -81,7 +81,7 @@ struct jbd_revoke_record_s
{
struct list_head hash;
tid_t sequence; /* Used for recovery only */
- unsigned long blocknr;
+ sector_t blocknr;
};
@@ -106,17 +106,18 @@ static void flush_descriptor(journal_t *
/* Utility functions to maintain the revoke table */
/* Borrowed from buffer.c: this is a tried and tested block hash function */
-static inline int hash(journal_t *journal, unsigned long block)
+static inline int hash(journal_t *journal, sector_t block)
{
struct jbd_revoke_table_s *table = journal->j_revoke;
int hash_shift = table->hash_shift;
+ int hash = (int)block ^ (int)(block >> 32);
- return ((block << (hash_shift - 6)) ^
- (block >> 13) ^
- (block << (hash_shift - 12))) & (table->hash_size - 1);
+ return ((hash << (hash_shift - 6)) ^
+ (hash >> 13) ^
+ (hash << (hash_shift - 12))) & (table->hash_size - 1);
}
-static int insert_revoke_hash(journal_t *journal, unsigned long blocknr,
+static int insert_revoke_hash(journal_t *journal, sector_t blocknr,
tid_t seq)
{
struct list_head *hash_list;
@@ -146,7 +147,7 @@ oom:
/* Find a revoke record in the journal's hash table. */
static struct jbd_revoke_record_s *find_revoke_record(journal_t *journal,
- unsigned long blocknr)
+ sector_t blocknr)
{
struct list_head *hash_list;
struct jbd_revoke_record_s *record;
@@ -325,7 +326,7 @@ void journal_destroy_revoke(journal_t *j
* by one.
*/
-int journal_revoke(handle_t *handle, unsigned long blocknr,
+int journal_revoke(handle_t *handle, sector_t blocknr,
struct buffer_head *bh_in)
{
struct buffer_head *bh = NULL;
@@ -394,7 +395,8 @@ int journal_revoke(handle_t *handle, uns
}
}
- jbd_debug(2, "insert revoke for block %lu, bh_in=%p\n", blocknr, bh_in);
+ jbd_debug(2, "insert revoke for block %llu, bh_in=%p\n",
+ blocknr, bh_in);
err = insert_revoke_hash(journal, blocknr,
handle->h_transaction->t_tid);
BUFFER_TRACE(bh_in, "exit");
@@ -649,7 +651,7 @@ static void flush_descriptor(journal_t *
*/
int journal_set_revoke(journal_t *journal,
- unsigned long blocknr,
+ sector_t blocknr,
tid_t sequence)
{
struct jbd_revoke_record_s *record;
@@ -673,7 +675,7 @@ int journal_set_revoke(journal_t *journa
*/
int journal_test_revoke(journal_t *journal,
- unsigned long blocknr,
+ sector_t blocknr,
tid_t sequence)
{
struct jbd_revoke_record_s *record;
diff -puN include/linux/ext3_jbd.h~sector_t-jbd include/linux/ext3_jbd.h
--- linux-2.6.17/include/linux/ext3_jbd.h~sector_t-jbd 2006-06-28 16:47:07.581701450 -0700
+++ linux-2.6.17-ming/include/linux/ext3_jbd.h 2006-06-28 16:47:07.597699614 -0700
@@ -154,7 +154,7 @@ __ext3_journal_forget(const char *where,
static inline int
__ext3_journal_revoke(const char *where, handle_t *handle,
- unsigned long blocknr, struct buffer_head *bh)
+ ext3_fsblk_t blocknr, struct buffer_head *bh)
{
int err = journal_revoke(handle, blocknr, bh);
if (err)
diff -puN include/linux/jbd.h~sector_t-jbd include/linux/jbd.h
--- linux-2.6.17/include/linux/jbd.h~sector_t-jbd 2006-06-28 16:47:07.585700991 -0700
+++ linux-2.6.17-ming/include/linux/jbd.h 2006-06-28 16:47:07.600699270 -0700
@@ -738,7 +738,7 @@ struct journal_s
*/
struct block_device *j_dev;
int j_blocksize;
- unsigned int j_blk_offset;
+ sector_t j_blk_offset;
/*
* Device which holds the client fs. For internal journal this will be
@@ -857,7 +857,7 @@ extern void __journal_clean_data_list(tr
/* Log buffer allocation */
extern struct journal_head * journal_get_descriptor_buffer(journal_t *);
-int journal_next_log_block(journal_t *, unsigned long *);
+int journal_next_log_block(journal_t *, sector_t *);
/* Commit management */
extern void journal_commit_transaction(journal_t *);
@@ -872,7 +872,7 @@ extern int
journal_write_metadata_buffer(transaction_t *transaction,
struct journal_head *jh_in,
struct journal_head **jh_out,
- int blocknr);
+ sector_t blocknr);
/* Transaction locking */
extern void __wait_on_journal (journal_t *);
@@ -920,7 +920,7 @@ extern void journal_unlock_updates (jou
extern journal_t * journal_init_dev(struct block_device *bdev,
struct block_device *fs_dev,
- int start, int len, int bsize);
+ sector_t start, int len, int bsize);
extern journal_t * journal_init_inode (struct inode *);
extern int journal_update_format (journal_t *);
extern int journal_check_used_features
@@ -941,7 +941,7 @@ extern void journal_abort (journ
extern int journal_errno (journal_t *);
extern void journal_ack_err (journal_t *);
extern int journal_clear_err (journal_t *);
-extern int journal_bmap(journal_t *, unsigned long, unsigned long *);
+extern int journal_bmap(journal_t *, unsigned long, sector_t *);
extern int journal_force_commit(journal_t *);
/*
@@ -974,14 +974,13 @@ extern void journal_destroy_revoke_ca
extern int journal_init_revoke_caches(void);
extern void journal_destroy_revoke(journal_t *);
-extern int journal_revoke (handle_t *,
- unsigned long, struct buffer_head *);
+extern int journal_revoke (handle_t *, sector_t, struct buffer_head *);
extern int journal_cancel_revoke(handle_t *, struct journal_head *);
extern void journal_write_revoke_records(journal_t *, transaction_t *);
/* Recovery revoke support */
-extern int journal_set_revoke(journal_t *, unsigned long, tid_t);
-extern int journal_test_revoke(journal_t *, unsigned long, tid_t);
+extern int journal_set_revoke(journal_t *, sector_t, tid_t);
+extern int journal_test_revoke(journal_t *, sector_t, tid_t);
extern void journal_clear_revoke(journal_t *);
extern void journal_brelse_array(struct buffer_head *b[], int n);
extern void journal_switch_revoke_table(journal_t *journal);
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (14 preceding siblings ...)
2006-06-30 0:18 ` [RFC][Update][Patch 11/16]JBD layer in-kernel block variables type fixes Mingming Cao
@ 2006-06-30 0:18 ` Mingming Cao
2006-06-30 3:15 ` H. Peter Anvin
2006-06-30 0:18 ` [RFC][Update][Patch 13/16] 48 bit on-disk i_file_acl support Mingming Cao
` (3 subsequent siblings)
19 siblings, 1 reply; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:18 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
"val >> 32" is undefined if val is a 32-bit value, so this code is
broken if CONFIG_LBD is not set. Make it safe for that case.
Signed-off-by: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
linux-2.6.17-ming/fs/jbd/revoke.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
diff -puN fs/jbd/revoke.c~jbd-revoke-32bit-shift-fix fs/jbd/revoke.c
--- linux-2.6.17/fs/jbd/revoke.c~jbd-revoke-32bit-shift-fix 2006-06-28 16:47:09.695458913 -0700
+++ linux-2.6.17-ming/fs/jbd/revoke.c 2006-06-28 16:47:09.699458454 -0700
@@ -110,7 +110,7 @@ static inline int hash(journal_t *journa
{
struct jbd_revoke_table_s *table = journal->j_revoke;
int hash_shift = table->hash_shift;
- int hash = (int)block ^ (int)(block >> 32);
+ int hash = (int)block ^ (int)((block >> 31) >> 1);
return ((hash << (hash_shift - 6)) ^
(hash >> 13) ^
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* Re: [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code
2006-06-30 0:18 ` [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code Mingming Cao
@ 2006-06-30 3:15 ` H. Peter Anvin
0 siblings, 0 replies; 295+ messages in thread
From: H. Peter Anvin @ 2006-06-30 3:15 UTC (permalink / raw)
To: cmm; +Cc: linux-kernel, ext2-devel, linux-fsdevel
Mingming Cao wrote:
> "val >> 32" is undefined if val is a 32-bit value, so this code is
> broken if CONFIG_LBD is not set. Make it safe for that case.
>
> Signed-off-by: Stephen Tweedie <sct@redhat.com>
> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
>
>
> ---
>
> linux-2.6.17-ming/fs/jbd/revoke.c | 2 +-
> 1 files changed, 1 insertion(+), 1 deletion(-)
>
> diff -puN fs/jbd/revoke.c~jbd-revoke-32bit-shift-fix fs/jbd/revoke.c
> --- linux-2.6.17/fs/jbd/revoke.c~jbd-revoke-32bit-shift-fix 2006-06-28 16:47:09.695458913 -0700
> +++ linux-2.6.17-ming/fs/jbd/revoke.c 2006-06-28 16:47:09.699458454 -0700
> @@ -110,7 +110,7 @@ static inline int hash(journal_t *journa
> {
> struct jbd_revoke_table_s *table = journal->j_revoke;
> int hash_shift = table->hash_shift;
> - int hash = (int)block ^ (int)(block >> 32);
> + int hash = (int)block ^ (int)((block >> 31) >> 1);
>
It might be better to code it as:
(int)((u64)block >> 32)
... which gcc can trivially recognize as 0 if block is 32 bits. Not
sure if it can do that with the code above.
-hpa
^ permalink raw reply [flat|nested] 295+ messages in thread
* [RFC][Update][Patch 13/16] 48 bit on-disk i_file_acl support
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (15 preceding siblings ...)
2006-06-30 0:18 ` [RFC][Update][Patch 12/16]Fix undefined ">> 32" in revoke code Mingming Cao
@ 2006-06-30 0:18 ` Mingming Cao
2006-06-30 0:19 ` [RFC][Update][Patch 14/16] 48bit super block (metadata) changes Mingming Cao
` (2 subsequent siblings)
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:18 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
As we are planning to support 48-bit block numbers for ext3,
we need to support 48-bit block numbers for extended attributes.
In the short term, we can do this by reuse (on-disk) 16-bit
padding (linux2.i_pad1 currently used only by "hurd") as high
order bits for xattr. This patch basically does that.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
---
linux-2.6.17-ming/fs/ext3/inode.c | 8 ++++++++
linux-2.6.17-ming/include/linux/ext3_fs.h | 6 ++++--
2 files changed, 12 insertions(+), 2 deletions(-)
diff -puN fs/ext3/inode.c~ext3_48bit_i_file_acl fs/ext3/inode.c
--- linux-2.6.17/fs/ext3/inode.c~ext3_48bit_i_file_acl 2006-06-28 16:47:11.921203527 -0700
+++ linux-2.6.17-ming/fs/ext3/inode.c 2006-06-28 16:47:11.932202265 -0700
@@ -2641,6 +2641,10 @@ void ext3_read_inode(struct inode * inod
ei->i_frag_size = raw_inode->i_fsize;
#endif
ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl);
+ if ((sizeof(sector_t) > 4) &&
+ (EXT3_SB(inode->i_sb)->s_es->s_creator_os != EXT3_OS_HURD))
+ ei->i_file_acl |=
+ ((__u64)le16_to_cpu(raw_inode->i_file_acl_high)) << 32;
if (!S_ISREG(inode->i_mode)) {
ei->i_dir_acl = le32_to_cpu(raw_inode->i_dir_acl);
} else {
@@ -2774,6 +2778,10 @@ static int ext3_do_update_inode(handle_t
raw_inode->i_frag = ei->i_frag_no;
raw_inode->i_fsize = ei->i_frag_size;
#endif
+ if ((sizeof(sector_t) > 4) &&
+ (EXT3_SB(inode->i_sb)->s_es->s_creator_os != EXT3_OS_HURD))
+ raw_inode->i_file_acl_high =
+ cpu_to_le16((__u64)ei->i_file_acl >> 32);
raw_inode->i_file_acl = cpu_to_le32(ei->i_file_acl);
if (!S_ISREG(inode->i_mode)) {
raw_inode->i_dir_acl = cpu_to_le32(ei->i_dir_acl);
diff -puN include/linux/ext3_fs.h~ext3_48bit_i_file_acl include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~ext3_48bit_i_file_acl 2006-06-28 16:47:11.925203068 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h 2006-06-28 16:47:11.934202036 -0700
@@ -285,7 +285,7 @@ struct ext3_inode {
struct {
__u8 l_i_frag; /* Fragment number */
__u8 l_i_fsize; /* Fragment size */
- __u16 i_pad1;
+ __u16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
__u32 l_i_reserved2;
@@ -301,7 +301,7 @@ struct ext3_inode {
struct {
__u8 m_i_frag; /* Fragment number */
__u8 m_i_fsize; /* Fragment size */
- __u16 m_pad1;
+ __u16 m_i_file_acl_high;
__u32 m_i_reserved2[2];
} masix2;
} osd2; /* OS dependent 2 */
@@ -315,6 +315,7 @@ struct ext3_inode {
#define i_reserved1 osd1.linux1.l_i_reserved1
#define i_frag osd2.linux2.l_i_frag
#define i_fsize osd2.linux2.l_i_fsize
+#define i_file_acl_high osd2.linux2.l_i_file_acl_high
#define i_uid_low i_uid
#define i_gid_low i_gid
#define i_uid_high osd2.linux2.l_i_uid_high
@@ -335,6 +336,7 @@ struct ext3_inode {
#define i_reserved1 osd1.masix1.m_i_reserved1
#define i_frag osd2.masix2.m_i_frag
#define i_fsize osd2.masix2.m_i_fsize
+#define i_file_acl_high osd2.masix2.m_i_file_acl_high
#define i_reserved2 osd2.masix2.m_i_reserved2
#endif /* defined(__KERNEL__) || defined(__linux__) */
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 14/16] 48bit super block (metadata) changes
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (16 preceding siblings ...)
2006-06-30 0:18 ` [RFC][Update][Patch 13/16] 48 bit on-disk i_file_acl support Mingming Cao
@ 2006-06-30 0:19 ` Mingming Cao
2006-06-30 0:19 ` [RFC][Update][Patch 15/16] compile warning fix and change 64bit to INCOMPAT feature Mingming Cao
2006-06-30 0:19 ` [RFC][Update][Patch 16/16]Update ext3 superblock definition Mingming Cao
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:19 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
In-kernel and on-disk super block changes to support >32 bit blocks numbers.
Signed-Off-By: Laurent Vivier <Laurent.Vivier@bull.net>
---
linux-2.6.17-ming/fs/ext3/balloc.c | 52 ++++++++++++------
linux-2.6.17-ming/fs/ext3/ialloc.c | 10 ++-
linux-2.6.17-ming/fs/ext3/inode.c | 9 ++-
linux-2.6.17-ming/fs/ext3/resize.c | 25 +++++----
linux-2.6.17-ming/fs/ext3/super.c | 50 ++++++++++--------
linux-2.6.17-ming/include/linux/ext3_fs.h | 83 +++++++++++++++++++++++++++++-
6 files changed, 169 insertions(+), 60 deletions(-)
diff -puN fs/ext3/balloc.c~64bit-metadata fs/ext3/balloc.c
--- linux-2.6.17/fs/ext3/balloc.c~64bit-metadata 2006-06-28 16:47:14.234938045 -0700
+++ linux-2.6.17-ming/fs/ext3/balloc.c 2006-06-28 16:47:14.257935406 -0700
@@ -88,12 +88,16 @@ read_block_bitmap(struct super_block *sb
desc = ext3_get_group_desc (sb, block_group, NULL);
if (!desc)
goto error_out;
- bh = sb_bread(sb, le32_to_cpu(desc->bg_block_bitmap));
+ bh = sb_bread(sb,
+ EXT3_BLOCK_BITMAP(desc,
+ ext3_group_first_block_no(sb, block_group)));
if (!bh)
ext3_error (sb, "read_block_bitmap",
"Cannot read block bitmap - "
- "block_group = %d, block_bitmap = %u",
- block_group, le32_to_cpu(desc->bg_block_bitmap));
+ "block_group = %d, block_bitmap = "E3FSBLK,
+ block_group,
+ EXT3_BLOCK_BITMAP(desc,
+ ext3_group_first_block_no(sb, block_group)));
error_out:
return bh;
}
@@ -328,7 +332,7 @@ void ext3_free_blocks_sb(handle_t *handl
es = sbi->s_es;
if (block < le32_to_cpu(es->s_first_data_block) ||
block + count < block ||
- block + count > le32_to_cpu(es->s_blocks_count)) {
+ block + count > EXT3_BLOCKS_COUNT(es)) {
ext3_error (sb, "ext3_free_blocks",
"Freeing blocks not in datazone - "
"block = "E3FSBLK", count = %lu", block, count);
@@ -356,11 +360,19 @@ do_more:
if (!desc)
goto error_return;
- if (in_range (le32_to_cpu(desc->bg_block_bitmap), block, count) ||
- in_range (le32_to_cpu(desc->bg_inode_bitmap), block, count) ||
- in_range (block, le32_to_cpu(desc->bg_inode_table),
+ if (in_range (EXT3_BLOCK_BITMAP(desc,
+ ext3_group_first_block_no(sb, block_group)),
+ block, count) ||
+ in_range (EXT3_INODE_BITMAP(desc,
+ ext3_group_first_block_no(sb, block_group)),
+ block, count) ||
+ in_range (block,
+ EXT3_INODE_TABLE(desc,
+ ext3_group_first_block_no(sb, block_group)),
sbi->s_itb_per_group) ||
- in_range (block + count - 1, le32_to_cpu(desc->bg_inode_table),
+ in_range (block + count - 1,
+ EXT3_INODE_TABLE(desc,
+ ext3_group_first_block_no(sb, block_group)),
sbi->s_itb_per_group))
ext3_error (sb, "ext3_free_blocks",
"Freeing blocks in system zones - "
@@ -1163,7 +1175,7 @@ static int ext3_has_free_blocks(struct e
ext3_fsblk_t free_blocks, root_blocks;
free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
- root_blocks = le32_to_cpu(sbi->s_es->s_r_blocks_count);
+ root_blocks = EXT3_R_BLOCKS_COUNT(sbi->s_es);
if (free_blocks < root_blocks + 1 && !capable(CAP_SYS_RESOURCE) &&
sbi->s_resuid != current->fsuid &&
(sbi->s_resgid == 0 || !in_group_p (sbi->s_resgid))) {
@@ -1262,7 +1274,7 @@ ext3_fsblk_t ext3_new_blocks(handle_t *h
* First, test whether the goal block is free.
*/
if (goal < le32_to_cpu(es->s_first_data_block) ||
- goal >= le32_to_cpu(es->s_blocks_count))
+ goal >= EXT3_BLOCKS_COUNT(es))
goal = le32_to_cpu(es->s_first_data_block);
ext3_get_group_no_and_offset(sb, goal, &group_no, &grp_target_blk);
gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
@@ -1361,11 +1373,15 @@ allocated:
ret_block = grp_alloc_blk + ext3_group_first_block_no(sb, group_no);
- if (in_range(le32_to_cpu(gdp->bg_block_bitmap), ret_block, num) ||
- in_range(le32_to_cpu(gdp->bg_inode_bitmap), ret_block, num) ||
- in_range(ret_block, le32_to_cpu(gdp->bg_inode_table),
+ if (in_range(EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, group_no)),
+ ret_block, num) ||
+ in_range(EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, group_no)),
+ ret_block, num) ||
+ in_range(ret_block, EXT3_INODE_TABLE(gdp,
+ ext3_group_first_block_no(sb, group_no)),
EXT3_SB(sb)->s_itb_per_group) ||
- in_range(ret_block + num - 1, le32_to_cpu(gdp->bg_inode_table),
+ in_range(ret_block + num - 1, EXT3_INODE_TABLE(gdp,
+ ext3_group_first_block_no(sb, group_no)),
EXT3_SB(sb)->s_itb_per_group))
ext3_error(sb, "ext3_new_block",
"Allocating block in system zone - "
@@ -1404,11 +1420,11 @@ allocated:
jbd_unlock_bh_state(bitmap_bh);
#endif
- if (ret_block + num - 1 >= le32_to_cpu(es->s_blocks_count)) {
+ if (ret_block + num - 1 >= EXT3_BLOCKS_COUNT(es)) {
ext3_error(sb, "ext3_new_block",
- "block("E3FSBLK") >= blocks count(%d) - "
+ "block("E3FSBLK") >= blocks count("E3FSBLK") - "
"block_group = %lu, es == %p ", ret_block,
- le32_to_cpu(es->s_blocks_count), group_no, es);
+ EXT3_BLOCKS_COUNT(es), group_no, es);
goto out;
}
@@ -1501,7 +1517,7 @@ ext3_fsblk_t ext3_count_free_blocks(stru
brelse(bitmap_bh);
printk("ext3_count_free_blocks: stored = "E3FSBLK
", computed = "E3FSBLK", "E3FSBLK"\n",
- le32_to_cpu(es->s_free_blocks_count),
+ EXT3_FREE_BLOCKS_COUNT(es),
desc_count, bitmap_count);
return bitmap_count;
#else
diff -puN fs/ext3/ialloc.c~64bit-metadata fs/ext3/ialloc.c
--- linux-2.6.17/fs/ext3/ialloc.c~64bit-metadata 2006-06-28 16:47:14.237937700 -0700
+++ linux-2.6.17-ming/fs/ext3/ialloc.c 2006-06-28 16:47:14.259935176 -0700
@@ -60,12 +60,14 @@ read_inode_bitmap(struct super_block * s
if (!desc)
goto error_out;
- bh = sb_bread(sb, le32_to_cpu(desc->bg_inode_bitmap));
+ bh = sb_bread(sb, EXT3_INODE_BITMAP(desc,
+ ext3_group_first_block_no(sb, block_group)));
if (!bh)
ext3_error(sb, "read_inode_bitmap",
"Cannot read inode bitmap - "
- "block_group = %lu, inode_bitmap = %u",
- block_group, le32_to_cpu(desc->bg_inode_bitmap));
+ "block_group = %lu, inode_bitmap = %llu",
+ block_group, EXT3_INODE_BITMAP(desc,
+ ext3_group_first_block_no(sb, block_group)));
error_out:
return bh;
}
@@ -304,7 +306,7 @@ static int find_group_orlov(struct super
goto fallback;
}
- blocks_per_dir = le32_to_cpu(es->s_blocks_count) - freeb;
+ blocks_per_dir = EXT3_BLOCKS_COUNT(es) - freeb;
sector_div(blocks_per_dir, ndirs);
max_dirs = ndirs / ngroups + inodes_per_group / 16;
diff -puN fs/ext3/inode.c~64bit-metadata fs/ext3/inode.c
--- linux-2.6.17/fs/ext3/inode.c~64bit-metadata 2006-06-28 16:47:14.241937242 -0700
+++ linux-2.6.17-ming/fs/ext3/inode.c 2006-06-28 16:47:14.263934718 -0700
@@ -2433,8 +2433,9 @@ static ext3_fsblk_t ext3_get_inode_block
*/
offset = ((ino - 1) % EXT3_INODES_PER_GROUP(sb)) *
EXT3_INODE_SIZE(sb);
- block = le32_to_cpu(gdp[desc].bg_inode_table) +
- (offset >> EXT3_BLOCK_SIZE_BITS(sb));
+ block = EXT3_INODE_TABLE((gdp+desc),
+ ext3_group_first_block_no(sb, block_group)) +
+ (offset >> EXT3_BLOCK_SIZE_BITS(sb));
iloc->block_group = block_group;
iloc->offset = offset & (EXT3_BLOCK_SIZE(sb) - 1);
@@ -2501,7 +2502,9 @@ static int __ext3_get_inode_loc(struct i
goto make_io;
bitmap_bh = sb_getblk(inode->i_sb,
- le32_to_cpu(desc->bg_inode_bitmap));
+ EXT3_INODE_BITMAP(desc,
+ ext3_group_first_block_no(inode->i_sb,
+ block_group)));
if (!bitmap_bh)
goto make_io;
diff -puN fs/ext3/resize.c~64bit-metadata fs/ext3/resize.c
--- linux-2.6.17/fs/ext3/resize.c~64bit-metadata 2006-06-28 16:47:14.245936783 -0700
+++ linux-2.6.17-ming/fs/ext3/resize.c 2006-06-28 16:47:14.266934373 -0700
@@ -27,7 +27,7 @@ static int verify_group_input(struct sup
{
struct ext3_sb_info *sbi = EXT3_SB(sb);
struct ext3_super_block *es = sbi->s_es;
- ext3_fsblk_t start = le32_to_cpu(es->s_blocks_count);
+ ext3_fsblk_t start = EXT3_BLOCKS_COUNT(es);
ext3_fsblk_t end = start + input->blocks_count;
unsigned group = input->group;
ext3_fsblk_t itend = input->inode_table + sbi->s_itb_per_group;
@@ -817,9 +817,12 @@ int ext3_group_add(struct super_block *s
/* Update group descriptor block for new group */
gdp = (struct ext3_group_desc *)primary->b_data + gdb_off;
- gdp->bg_block_bitmap = cpu_to_le32(input->block_bitmap);
- gdp->bg_inode_bitmap = cpu_to_le32(input->inode_bitmap);
- gdp->bg_inode_table = cpu_to_le32(input->inode_table);
+ EXT3_BLOCK_BITMAP_SET(gdp, ext3_group_first_block_no(sb, gdb_num),
+ input->block_bitmap); /* LV FIXME */
+ EXT3_INODE_BITMAP_SET(gdp, ext3_group_first_block_no(sb, gdb_num),
+ input->inode_bitmap); /* LV FIXME */
+ EXT3_INODE_TABLE_SET(gdp, ext3_group_first_block_no(sb, gdb_num),
+ input->inode_table); /* LV FIXME */
gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count);
gdp->bg_free_inodes_count = cpu_to_le16(EXT3_INODES_PER_GROUP(sb));
@@ -833,7 +836,7 @@ int ext3_group_add(struct super_block *s
* blocks/inodes before the group is live won't actually let us
* allocate the new space yet.
*/
- es->s_blocks_count = cpu_to_le32(le32_to_cpu(es->s_blocks_count) +
+ EXT3_BLOCKS_COUNT_SET(es, EXT3_BLOCKS_COUNT(es) +
input->blocks_count);
es->s_inodes_count = cpu_to_le32(le32_to_cpu(es->s_inodes_count) +
EXT3_INODES_PER_GROUP(sb));
@@ -869,7 +872,7 @@ int ext3_group_add(struct super_block *s
/* Update the reserved block counts only once the new group is
* active. */
- es->s_r_blocks_count = cpu_to_le32(le32_to_cpu(es->s_r_blocks_count) +
+ EXT3_R_BLOCKS_COUNT_SET(es, EXT3_R_BLOCKS_COUNT(es) +
input->reserved_blocks);
/* Update the free space counts */
@@ -920,7 +923,7 @@ int ext3_group_extend(struct super_block
/* We don't need to worry about locking wrt other resizers just
* yet: we're going to revalidate es->s_blocks_count after
* taking lock_super() below. */
- o_blocks_count = le32_to_cpu(es->s_blocks_count);
+ o_blocks_count = EXT3_BLOCKS_COUNT(es);
o_groups_count = EXT3_SB(sb)->s_groups_count;
if (test_opt(sb, DEBUG))
@@ -986,7 +989,7 @@ int ext3_group_extend(struct super_block
}
lock_super(sb);
- if (o_blocks_count != le32_to_cpu(es->s_blocks_count)) {
+ if (o_blocks_count != EXT3_BLOCKS_COUNT(es)) {
ext3_warning(sb, __FUNCTION__,
"multiple resizers run on filesystem!");
unlock_super(sb);
@@ -1002,7 +1005,7 @@ int ext3_group_extend(struct super_block
ext3_journal_stop(handle);
goto exit_put;
}
- es->s_blocks_count = cpu_to_le32(o_blocks_count + add);
+ EXT3_BLOCKS_COUNT_SET(es, o_blocks_count + add);
ext3_journal_dirty_metadata(handle, EXT3_SB(sb)->s_sbh);
sb->s_dirt = 1;
unlock_super(sb);
@@ -1014,8 +1017,8 @@ int ext3_group_extend(struct super_block
if ((err = ext3_journal_stop(handle)))
goto exit_put;
if (test_opt(sb, DEBUG))
- printk(KERN_DEBUG "EXT3-fs: extended group to %u blocks\n",
- le32_to_cpu(es->s_blocks_count));
+ printk(KERN_DEBUG "EXT3-fs: extended group to %llu blocks\n",
+ EXT3_BLOCKS_COUNT(es));
update_backups(sb, EXT3_SB(sb)->s_sbh->b_blocknr, (char *)es,
sizeof(struct ext3_super_block));
exit_put:
diff -puN fs/ext3/super.c~64bit-metadata fs/ext3/super.c
--- linux-2.6.17/fs/ext3/super.c~64bit-metadata 2006-06-28 16:47:14.248936438 -0700
+++ linux-2.6.17-ming/fs/ext3/super.c 2006-06-28 16:47:14.270933914 -0700
@@ -1151,44 +1151,48 @@ static int ext3_check_descriptors (struc
if ((i % EXT3_DESC_PER_BLOCK(sb)) == 0)
gdp = (struct ext3_group_desc *)
sbi->s_group_desc[desc_block++]->b_data;
- if (le32_to_cpu(gdp->bg_block_bitmap) < block ||
- le32_to_cpu(gdp->bg_block_bitmap) >=
+ if (EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, i)) <
+ block ||
+ EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, i)) >=
block + EXT3_BLOCKS_PER_GROUP(sb))
{
ext3_error (sb, "ext3_check_descriptors",
"Block bitmap for group %d"
" not in group (block %lu)!",
i, (unsigned long)
- le32_to_cpu(gdp->bg_block_bitmap));
+ EXT3_BLOCK_BITMAP(gdp, ext3_group_first_block_no(sb, i)));
return 0;
}
- if (le32_to_cpu(gdp->bg_inode_bitmap) < block ||
- le32_to_cpu(gdp->bg_inode_bitmap) >=
+ if (EXT3_INODE_BITMAP(gdp, ext3_group_first_block_no(sb, i)) <
+ block ||
+ EXT3_INODE_BITMAP(gdp, ext3_group_first_block_no(sb, i)) >=
block + EXT3_BLOCKS_PER_GROUP(sb))
{
ext3_error (sb, "ext3_check_descriptors",
"Inode bitmap for group %d"
" not in group (block %lu)!",
i, (unsigned long)
- le32_to_cpu(gdp->bg_inode_bitmap));
+ EXT3_INODE_BITMAP(gdp, ext3_group_first_block_no(sb, i)));
return 0;
}
- if (le32_to_cpu(gdp->bg_inode_table) < block ||
- le32_to_cpu(gdp->bg_inode_table) + sbi->s_itb_per_group >=
- block + EXT3_BLOCKS_PER_GROUP(sb))
+ if (EXT3_INODE_TABLE(gdp, ext3_group_first_block_no(sb, i)) <
+ block ||
+ EXT3_INODE_TABLE(gdp, ext3_group_first_block_no(sb, i)) +
+ sbi->s_itb_per_group >=
+ block + EXT3_BLOCKS_PER_GROUP(sb))
{
ext3_error (sb, "ext3_check_descriptors",
"Inode table for group %d"
" not in group (block %lu)!",
i, (unsigned long)
- le32_to_cpu(gdp->bg_inode_table));
+ EXT3_INODE_TABLE(gdp, ext3_group_first_block_no(sb, i)));
return 0;
}
block += EXT3_BLOCKS_PER_GROUP(sb);
gdp++;
}
- sbi->s_es->s_free_blocks_count=cpu_to_le32(ext3_count_free_blocks(sb));
+ EXT3_FREE_BLOCKS_COUNT_SET(sbi->s_es, ext3_count_free_blocks(sb));
sbi->s_es->s_free_inodes_count=cpu_to_le32(ext3_count_free_inodes(sb));
return 1;
}
@@ -1365,6 +1369,7 @@ static int ext3_fill_super (struct super
int i;
int needs_recovery;
__le32 features;
+ __u64 blocks_count;
sbi = kmalloc(sizeof(*sbi), GFP_KERNEL);
if (!sbi)
@@ -1575,7 +1580,7 @@ static int ext3_fill_super (struct super
goto failed_mount;
}
- if (le32_to_cpu(es->s_blocks_count) >
+ if (EXT3_BLOCKS_COUNT(es) >
(sector_t)(~0ULL) >> (sb->s_blocksize_bits - 9)) {
printk(KERN_ERR "EXT3-fs: filesystem on %s:"
" too large to mount safely\n", sb->s_id);
@@ -1587,10 +1592,11 @@ static int ext3_fill_super (struct super
if (EXT3_BLOCKS_PER_GROUP(sb) == 0)
goto cantfind_ext3;
- sbi->s_groups_count = (le32_to_cpu(es->s_blocks_count) -
- le32_to_cpu(es->s_first_data_block) +
- EXT3_BLOCKS_PER_GROUP(sb) - 1) /
- EXT3_BLOCKS_PER_GROUP(sb);
+ blocks_count = (EXT3_BLOCKS_COUNT(es) -
+ le32_to_cpu(es->s_first_data_block) +
+ EXT3_BLOCKS_PER_GROUP(sb) - 1);
+ do_div(blocks_count, EXT3_BLOCKS_PER_GROUP(sb));
+ sbi->s_groups_count = blocks_count;
db_count = (sbi->s_groups_count + EXT3_DESC_PER_BLOCK(sb) - 1) /
EXT3_DESC_PER_BLOCK(sb);
sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
@@ -1904,7 +1910,7 @@ static journal_t *ext3_get_dev_journal(s
goto out_bdev;
}
- len = le32_to_cpu(es->s_blocks_count);
+ len = EXT3_BLOCKS_COUNT(es);
start = sb_block + 1;
brelse(bh); /* we're done with the superblock */
@@ -2074,7 +2080,7 @@ static void ext3_commit_super (struct su
if (!sbh)
return;
es->s_wtime = cpu_to_le32(get_seconds());
- es->s_free_blocks_count = cpu_to_le32(ext3_count_free_blocks(sb));
+ EXT3_FREE_BLOCKS_COUNT_SET(es, ext3_count_free_blocks(sb));
es->s_free_inodes_count = cpu_to_le32(ext3_count_free_inodes(sb));
BUFFER_TRACE(sbh, "marking dirty");
mark_buffer_dirty(sbh);
@@ -2267,7 +2273,7 @@ static int ext3_remount (struct super_bl
ext3_init_journal_params(sb, sbi->s_journal);
if ((*flags & MS_RDONLY) != (sb->s_flags & MS_RDONLY) ||
- n_blocks_count > le32_to_cpu(es->s_blocks_count)) {
+ n_blocks_count > EXT3_BLOCKS_COUNT(es)) {
if (sbi->s_mount_opt & EXT3_MOUNT_ABORT) {
err = -EROFS;
goto restore_opts;
@@ -2388,10 +2394,10 @@ static int ext3_statfs (struct dentry *
buf->f_type = EXT3_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
- buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
+ buf->f_blocks = EXT3_BLOCKS_COUNT(es) - overhead;
buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
- buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
- if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
+ buf->f_bavail = buf->f_bfree - EXT3_R_BLOCKS_COUNT(es);
+ if (buf->f_bfree < EXT3_R_BLOCKS_COUNT(es))
buf->f_bavail = 0;
buf->f_files = le32_to_cpu(es->s_inodes_count);
buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
diff -puN include/linux/ext3_fs.h~64bit-metadata include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~64bit-metadata 2006-06-28 16:47:14.252935980 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h 2006-06-28 16:47:14.273933570 -0700
@@ -136,6 +136,54 @@ struct ext3_group_desc
__le32 bg_reserved[3];
};
+static inline u32 EXT3_RELATIVE_ENCODE(ext3_fsblk_t group_base,
+ ext3_fsblk_t fs_block)
+{
+ s32 gdp_block;
+
+ if (fs_block < (1ULL<<32) && group_base < (1ULL<<32))
+ return fs_block;
+
+ gdp_block = (fs_block - group_base);
+ BUG_ON ((group_base + gdp_block) != fs_block);
+
+ return gdp_block;
+}
+
+static inline ext3_fsblk_t EXT3_RELATIVE_DECODE(ext3_fsblk_t group_base,
+ u32 gdp_block)
+{
+ if (group_base >= (1ULL<<32))
+ return group_base + (s32) gdp_block;
+
+ if ((s32) gdp_block >= 0 && gdp_block < group_base &&
+ group_base + gdp_block >= (1ULL<<32))
+ return group_base + gdp_block;
+
+ return gdp_block;
+}
+
+#define EXT3_BLOCK_BITMAP(bg, group_base) \
+ EXT3_RELATIVE_DECODE(group_base, le32_to_cpu((bg)->bg_block_bitmap))
+#define EXT3_INODE_BITMAP(bg, group_base) \
+ EXT3_RELATIVE_DECODE(group_base, le32_to_cpu((bg)->bg_inode_bitmap))
+#define EXT3_INODE_TABLE(bg, group_base) \
+ EXT3_RELATIVE_DECODE(group_base, le32_to_cpu((bg)->bg_inode_table))
+
+#define EXT3_BLOCK_BITMAP_SET(bg, group_base, value) \
+ do {(bg)->bg_block_bitmap = EXT3_RELATIVE_ENCODE(group_base, value);} while(0)
+#define EXT3_INODE_BITMAP_SET(bg, group_base, value) \
+ do {(bg)->bg_inode_bitmap = EXT3_RELATIVE_ENCODE(group_base, value);} while(0)
+#define EXT3_INODE_TABLE_SET(bg, group_base, value) \
+ do {(bg)->bg_inode_table = EXT3_RELATIVE_ENCODE(group_base, value);} while(0)
+
+#define EXT3_IS_USED_BLOCK_BITMAP(bg) \
+ ((bg)->bg_block_bitmap != 0)
+#define EXT3_IS_USED_INODE_BITMAP(bg) \
+ ((bg)->bg_inode_bitmap != 0)
+#define EXT3_IS_USED_INODE_TABLE(bg) \
+ ((bg)->bg_inode_table != 0)
+
/*
* Macro-instructions used to manage group descriptors
*/
@@ -483,9 +531,38 @@ struct ext3_super_block {
__u16 s_reserved_word_pad;
__le32 s_default_mount_opts;
__le32 s_first_meta_bg; /* First metablock block group */
- __u32 s_reserved[190]; /* Padding to the end of the block */
+ /* 64bit support valid if EXT3_FEATURE_COMPAT_64BIT */
+ __le32 s_blocks_count_hi; /* Blocks count */
+/*100*/ __le32 s_r_blocks_count_hi; /* Reserved blocks count */
+ __le32 s_free_blocks_count_hi; /* Free blocks count */
+ __u32 s_reserved[187]; /* Padding to the end of the block */
};
+
+#define EXT3_BLOCKS_COUNT(s) \
+ (ext3_fsblk_t)(((__u64)le32_to_cpu((s)->s_blocks_count_hi) << 32) | \
+ (__u64)le32_to_cpu((s)->s_blocks_count))
+#define EXT3_BLOCKS_COUNT_SET(s,v) do { \
+ (s)->s_blocks_count = cpu_to_le32((v)); \
+ (s)->s_blocks_count_hi = cpu_to_le32(((__u64)(v)) >> 32); \
+} while (0)
+
+#define EXT3_R_BLOCKS_COUNT(s) \
+ (ext3_fsblk_t)(((__u64)le32_to_cpu((s)->s_r_blocks_count_hi) << 32) | \
+ (__u64)le32_to_cpu((s)->s_r_blocks_count))
+#define EXT3_R_BLOCKS_COUNT_SET(s,v) do { \
+ (s)->s_r_blocks_count = cpu_to_le32((v)); \
+ (s)->s_r_blocks_count_hi = cpu_to_le32(((__u64)(v)) >> 32); \
+} while (0)
+
+#define EXT3_FREE_BLOCKS_COUNT(s) \
+ (ext3_fsblk_t)(((__u64)le32_to_cpu((s)->s_free_blocks_count_hi) << 32) | \
+ (__u64)le32_to_cpu((s)->s_free_blocks_count))
+#define EXT3_FREE_BLOCKS_COUNT_SET(s,v) do { \
+ (s)->s_free_blocks_count = cpu_to_le32((v)); \
+ (s)->s_free_blocks_count_hi = cpu_to_le32(((__u64)(v)) >> 32); \
+} while (0)
+
#ifdef __KERNEL__
#include <linux/ext3_fs_i.h>
#include <linux/ext3_fs_sb.h>
@@ -559,6 +636,7 @@ static inline struct ext3_inode_info *EX
#define EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
#define EXT3_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
#define EXT3_FEATURE_RO_COMPAT_BTREE_DIR 0x0004
+#define EXT3_FEATURE_RO_COMPAT_64BIT 0x0010
#define EXT3_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT3_FEATURE_INCOMPAT_FILETYPE 0x0002
@@ -574,7 +652,8 @@ static inline struct ext3_inode_info *EX
EXT3_FEATURE_INCOMPAT_EXTENTS)
#define EXT3_FEATURE_RO_COMPAT_SUPP (EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \
- EXT3_FEATURE_RO_COMPAT_BTREE_DIR)
+ EXT3_FEATURE_RO_COMPAT_BTREE_DIR| \
+ EXT3_FEATURE_RO_COMPAT_64BIT)
/*
* Default values for user and/or group using reserved blocks
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 15/16] compile warning fix and change 64bit to INCOMPAT feature
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (17 preceding siblings ...)
2006-06-30 0:19 ` [RFC][Update][Patch 14/16] 48bit super block (metadata) changes Mingming Cao
@ 2006-06-30 0:19 ` Mingming Cao
2006-06-30 0:19 ` [RFC][Update][Patch 16/16]Update ext3 superblock definition Mingming Cao
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:19 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
Change the 64bit to INCOMPAT feature, and fixed compile warning in the 64bit_metadata patch.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
linux-2.6.17-ming/include/linux/ext3_fs.h | 15 ++++++++-------
1 files changed, 8 insertions(+), 7 deletions(-)
diff -puN include/linux/ext3_fs.h~64bit-incompat-flag-change include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~64bit-incompat-flag-change 2006-06-28 16:47:16.224709734 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h 2006-06-28 16:49:17.471797610 -0700
@@ -136,6 +136,9 @@ struct ext3_group_desc
__le32 bg_reserved[3];
};
+#ifdef __KERNEL__
+#include <linux/ext3_fs_i.h>
+#include <linux/ext3_fs_sb.h>
static inline u32 EXT3_RELATIVE_ENCODE(ext3_fsblk_t group_base,
ext3_fsblk_t fs_block)
{
@@ -183,7 +186,7 @@ static inline ext3_fsblk_t EXT3_RELATIVE
((bg)->bg_inode_bitmap != 0)
#define EXT3_IS_USED_INODE_TABLE(bg) \
((bg)->bg_inode_table != 0)
-
+#endif
/*
* Macro-instructions used to manage group descriptors
*/
@@ -564,8 +567,6 @@ struct ext3_super_block {
} while (0)
#ifdef __KERNEL__
-#include <linux/ext3_fs_i.h>
-#include <linux/ext3_fs_sb.h>
static inline struct ext3_sb_info * EXT3_SB(struct super_block *sb)
{
return sb->s_fs_info;
@@ -636,7 +637,6 @@ static inline struct ext3_inode_info *EX
#define EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
#define EXT3_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
#define EXT3_FEATURE_RO_COMPAT_BTREE_DIR 0x0004
-#define EXT3_FEATURE_RO_COMPAT_64BIT 0x0010
#define EXT3_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT3_FEATURE_INCOMPAT_FILETYPE 0x0002
@@ -644,16 +644,17 @@ static inline struct ext3_inode_info *EX
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */
#define EXT3_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040 /* extents support */
+#define EXT3_FEATURE_INCOMPAT_64BIT 0x0080
#define EXT3_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT3_FEATURE_INCOMPAT_SUPP (EXT3_FEATURE_INCOMPAT_FILETYPE| \
EXT3_FEATURE_INCOMPAT_RECOVER| \
EXT3_FEATURE_INCOMPAT_META_BG| \
- EXT3_FEATURE_INCOMPAT_EXTENTS)
+ EXT3_FEATURE_INCOMPAT_EXTENTS| \
+ EXT3_FEATURE_INCOMPAT_64BIT)
#define EXT3_FEATURE_RO_COMPAT_SUPP (EXT3_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT3_FEATURE_RO_COMPAT_LARGE_FILE| \
- EXT3_FEATURE_RO_COMPAT_BTREE_DIR| \
- EXT3_FEATURE_RO_COMPAT_64BIT)
+ EXT3_FEATURE_RO_COMPAT_BTREE_DIR)
/*
* Default values for user and/or group using reserved blocks
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread* [RFC][Update][Patch 16/16]Update ext3 superblock definition
2006-06-09 1:20 [RFC 0/13] extents and 48bit ext3 Mingming Cao
` (18 preceding siblings ...)
2006-06-30 0:19 ` [RFC][Update][Patch 15/16] compile warning fix and change 64bit to INCOMPAT feature Mingming Cao
@ 2006-06-30 0:19 ` Mingming Cao
19 siblings, 0 replies; 295+ messages in thread
From: Mingming Cao @ 2006-06-30 0:19 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-fsdevel, ext2-devel
The ext3 on-disk superblock definition in the kernel is lagging
behind some e2fsprogs-only fields (the backup of the journal inode,
and the mkfs timestamp), leading to the high bits of the fs size
being declared in a field already reserved by e2fsprogs. Bring them
back in sync.
Signed-off-by: Stephen Tweedie <sct@redhat.com>
---
linux-2.6.17-ming/include/linux/ext3_fs.h | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
diff -puN include/linux/ext3_fs.h~ext3-sb-struc-sync-with-e2fsprog include/linux/ext3_fs.h
--- linux-2.6.17/include/linux/ext3_fs.h~ext3-sb-struc-sync-with-e2fsprog 2006-06-28 16:47:18.377462723 -0700
+++ linux-2.6.17-ming/include/linux/ext3_fs.h 2006-06-28 16:47:18.381462264 -0700
@@ -532,13 +532,15 @@ struct ext3_super_block {
__u8 s_def_hash_version; /* Default hash version to use */
__u8 s_reserved_char_pad;
__u16 s_reserved_word_pad;
- __le32 s_default_mount_opts;
+/*100*/ __le32 s_default_mount_opts;
__le32 s_first_meta_bg; /* First metablock block group */
+ __le32 s_mkfs_time; /* When the filesystem was created */
+ __le32 s_jnl_blocks[17]; /* Backup of the journal inode */
/* 64bit support valid if EXT3_FEATURE_COMPAT_64BIT */
- __le32 s_blocks_count_hi; /* Blocks count */
-/*100*/ __le32 s_r_blocks_count_hi; /* Reserved blocks count */
+/*150*/ __le32 s_blocks_count_hi; /* Blocks count */
+ __le32 s_r_blocks_count_hi; /* Reserved blocks count */
__le32 s_free_blocks_count_hi; /* Free blocks count */
- __u32 s_reserved[187]; /* Padding to the end of the block */
+ __u32 s_reserved[169]; /* Padding to the end of the block */
};
_
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
^ permalink raw reply [flat|nested] 295+ messages in thread