Btree directories (Re: Status of HFS+ support)

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Btree directories (Re: Status of HFS+ support)
  2000-08-29 16:40 Status of HFS+ support (was hfs support for blocksize != 512) Halfmann, Klaus
@ 2000-08-29 17:18 ` Matthew Wilcox
  2000-08-29 18:45   ` Steve Lord
                     ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Matthew Wilcox @ 2000-08-29 17:18 UTC (permalink / raw)
  To: Halfmann, Klaus
  Cc: 'Matthew Wilcox', Roman Zippel, linuxppc-dev,
	linux-fsdevel

On Tue, Aug 29, 2000 at 06:40:08PM +0200, Halfmann, Klaus wrote:
> On the other hand: The hfs acces is normally not needed for heavy
> concurrent acces. (Mhh, there might be AFP Servers publishing HFS
> partions) For now it should be enough to have a global update
> lock on the root of the node. Im not firm with kernel code yet, dont
> ask me about the details :-)

It's not that easy.  Unfortunately, getdents is a nasty interface (I
actually haven't seen a good one -- anyone know of one which doesn't
suffer from this problem?)

You can lock while you're in a call, but eventually, you fill up the
user's buffer and return.  At that point, you have to drop the lock
because they might never call you again.  So you have to consider
the case of a file being added or removed between calls to getdents.
Current HFS doesn't even pretend to try.  It just stores an index (0
.. n-1) and you pick up from there.  So sometimes you get files twice,
sometimes files don't show up at all.

I suspect you need to store the key of the item you just found, and
then continue filling in the buffer from there on subsequent calls.
The trouble is that there frequently isn't enough space available to
do that.  The proposed ext2 btree extensins needed 64 bits of space and
there was only 32 bits available.  What size keys does HFS+ have?

Hmmm.. now the LFS patches have gone in and f_pos is now a 64-bit quantity,
this sounds more plausible.  I'd be curious to hear from the ReiserFS
people how they solved this problem.

There's an additional complication of rewinddir/seekdir/telldir, but
let's get onto that later.

> I'd really like to know how the concurrency of the kernel filesystem
> should work, can anybody feed me with documentations about that ?

I've cc'd linux-fsdevel, where people can fill you in.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-29 17:18 ` Btree directories (Re: Status of HFS+ support) Matthew Wilcox
@ 2000-08-29 18:45   ` Steve Lord
  2000-08-29 21:25     ` Matthew Wilcox
  2000-08-30  2:22   ` flar
  2000-08-30 14:25   ` Chris Mason
  2 siblings, 1 reply; 28+ messages in thread
From: Steve Lord @ 2000-08-29 18:45 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Halfmann, Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel


> On Tue, Aug 29, 2000 at 06:40:08PM +0200, Halfmann, Klaus wrote:
> > On the other hand: The hfs acces is normally not needed for heavy
> > concurrent acces. (Mhh, there might be AFP Servers publishing HFS
> > partions) For now it should be enough to have a global update
> > lock on the root of the node. Im not firm with kernel code yet, dont
> > ask me about the details :-)
>
> It's not that easy.  Unfortunately, getdents is a nasty interface (I
> actually haven't seen a good one -- anyone know of one which doesn't
> suffer from this problem?)
>
> You can lock while you're in a call, but eventually, you fill up the
> user's buffer and return.  At that point, you have to drop the lock
> because they might never call you again.  So you have to consider
> the case of a file being added or removed between calls to getdents.
> Current HFS doesn't even pretend to try.  It just stores an index (0
> .. n-1) and you pick up from there.  So sometimes you get files twice,
> sometimes files don't show up at all.
>
> I suspect you need to store the key of the item you just found, and
> then continue filling in the buffer from there on subsequent calls.
> The trouble is that there frequently isn't enough space available to
> do that.  The proposed ext2 btree extensins needed 64 bits of space and
> there was only 32 bits available.  What size keys does HFS+ have?
>
> Hmmm.. now the LFS patches have gone in and f_pos is now a 64-bit quantity,
> this sounds more plausible.  I'd be curious to hear from the ReiserFS
> people how they solved this problem.

f_pos being 64 bit is one thing, go look at the source code for glibc
getdents, it is not 64 bit friendly, actually it is only 31 bit friendly,
if you return anything bigger than a signed integer it gets confused and
can skip entries - or go into an infinite loop.

>
> There's an additional complication of rewinddir/seekdir/telldir, but
> let's get onto that later.

Which glibc getdents uses because it's dirent is different in size from
the kernel's dirent, it uses a heuristic to decide how much data to ask
the kernel for. If it guesses wrong and gets more data back from the
kernel than will fit in the user's buffer it seeks backwards again to
reposition for the next call.

Steve Lord


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-29 18:45   ` Steve Lord
@ 2000-08-29 21:25     ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2000-08-29 21:25 UTC (permalink / raw)
  To: Steve Lord
  Cc: Matthew Wilcox, Halfmann, Klaus, Roman Zippel, linuxppc-dev,
	linux-fsdevel


On Tue, Aug 29, 2000 at 01:45:51PM -0500, Steve Lord wrote:
> f_pos being 64 bit is one thing, go look at the source code for glibc
> getdents, it is not 64 bit friendly, actually it is only 31 bit friendly,
> if you return anything bigger than a signed integer it gets confused and
> can skip entries - or go into an infinite loop.

hmm.. glibc 2.2 looks better from this PoV..  but i haven't tested it,
is the bug gone?

> > There's an additional complication of rewinddir/seekdir/telldir, but
> > let's get onto that later.
>
> Which glibc getdents uses because it's dirent is different in size from
> the kernel's dirent, it uses a heuristic to decide how much data to ask
> the kernel for. If it guesses wrong and gets more data back from the
> kernel than will fit in the user's buffer it seeks backwards again to
> reposition for the next call.

Exactly.  So, as a minimum, we have to be able to support seeking to one
of the keys in the B-tree.  It'd be nice if the search algorithm coped
with searching for an entry which is between two values in the B-tree
and returned the lower one.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-29 17:18 ` Btree directories (Re: Status of HFS+ support) Matthew Wilcox
  2000-08-29 18:45   ` Steve Lord
@ 2000-08-30  2:22   ` flar
  2000-08-30  8:35     ` Andi Kleen
  2000-08-30 14:25   ` Chris Mason
  2 siblings, 1 reply; 28+ messages in thread
From: flar @ 2000-08-30  2:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Halfmann Klaus, 'Matthew Wilcox', Roman Zippel,
	linuxppc-dev, linux-fsdevel


Matthew Wilcox wrote:
> You can lock while you're in a call, but eventually, you fill up the
> user's buffer and return.  At that point, you have to drop the lock
> because they might never call you again.  So you have to consider
> the case of a file being added or removed between calls to getdents.
> Current HFS doesn't even pretend to try.  It just stores an index (0
> .. n-1) and you pick up from there.  So sometimes you get files twice,
> sometimes files don't show up at all.
>
> I suspect you need to store the key of the item you just found, and
> then continue filling in the buffer from there on subsequent calls.
> The trouble is that there frequently isn't enough space available to
> do that.  The proposed ext2 btree extensins needed 64 bits of space and
> there was only 32 bits available.  What size keys does HFS+ have?

Catalog keys in HFS+ can be anywhere from 8 bytes all the way up to
518 bytes. I actually do save the key and do a search in the btree
in my HFS+ module. (It doesn't seem to work however. And yes, I'm
helping out Klaus with his code as well as working on a kernel
module with completely separate code.)

> Hmmm.. now the LFS patches have gone in and f_pos is now a 64-bit quantity,
> this sounds more plausible.  I'd be curious to hear from the ReiserFS
> people how they solved this problem.

The support for 64 bit filesystems is important for HFS+ as well, since
it uses 64 bit file sizes for all files.

> > I'd really like to know how the concurrency of the kernel filesystem
> > should work, can anybody feed me with documentations about that ?
>
> I've cc'd linux-fsdevel, where people can fill you in.

I've been ignoring this issue, but I certainly would love pointers to
more documentation.

	Brad Boyer
	flar@pants.nu


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30  2:22   ` flar
@ 2000-08-30  8:35     ` Andi Kleen
  0 siblings, 0 replies; 28+ messages in thread
From: Andi Kleen @ 2000-08-30  8:35 UTC (permalink / raw)
  To: flar
  Cc: Matthew Wilcox, Halfmann Klaus, Roman Zippel, linuxppc-dev,
	linux-fsdevel

On Tue, Aug 29, 2000 at 07:22:05PM -0700, flar@allandria.com wrote:
> Matthew Wilcox wrote:
> > You can lock while you're in a call, but eventually, you fill up the
> > user's buffer and return.  At that point, you have to drop the lock
> > because they might never call you again.  So you have to consider
> > the case of a file being added or removed between calls to getdents.
> > Current HFS doesn't even pretend to try.  It just stores an index (0
> > .. n-1) and you pick up from there.  So sometimes you get files twice,
> > sometimes files don't show up at all.
> >
> > I suspect you need to store the key of the item you just found, and
> > then continue filling in the buffer from there on subsequent calls.
> > The trouble is that there frequently isn't enough space available to
> > do that.  The proposed ext2 btree extensins needed 64 bits of space and
> > there was only 32 bits available.  What size keys does HFS+ have?
>
> Catalog keys in HFS+ can be anywhere from 8 bytes all the way up to
> 518 bytes. I actually do save the key and do a search in the btree
> in my HFS+ module. (It doesn't seem to work however. And yes, I'm
> helping out Klaus with his code as well as working on a kernel
> module with completely separate code.)

You're sharing a problem with JFS then (and reiserfs to a certain
extent).

In theory you could put as much state as you want into a private field
of the file structure of the directory. This just ignores two ugly things:
NFS and telldir. Both basically require you to generate stateless 32bit
cookies that safely walk your directory.

The only good solution I know is a stable namespace that never shrinks unless
the last entry is removed (basically what ext2 directories do). Then you can
just have an offset into that namespace. Namespace could be hash + collision
counter.

[XFS uses some trick of assuming that buffer sizes are big enough for one
of their hash buckets, but it'll surely break in a lot of cases]

Without stable name space probably your only option is to add dummy entries
into the btree holes instead of rebalancing the tree, and removing all holes
when the last entry goes away.

-Andi

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
@ 2000-08-30 12:06 Steve Lord
  2000-08-30 15:07 ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: Steve Lord @ 2000-08-30 12:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: flar, Matthew Wilcox, Halfmann Klaus, Roman Zippel, linuxppc-dev,
	linux-fsdevel

> In theory you could put as much state as you want into a private field
> of the file structure of the directory. This just ignores two ugly things:
> NFS and telldir. Both basically require you to generate stateless 32bit
> cookies that safely walk your directory.

Can't we at least pass the cookie generation and intepretation down to
the filesystem layer, since at least NFS V3 allows bigger handles.

> The only good solution I know is a stable namespace that never shrinks unless
> the last entry is removed (basically what ext2 directories do). Then you can
> just have an offset into that namespace. Namespace could be hash + collision
> counter.
>
> [XFS uses some trick of assuming that buffer sizes are big enough for one
> of their hash buckets, but it'll surely break in a lot of cases]

Not strictly true - the version 1 format of directories in XFS will have
problems with this - they return a 64 bit hash value, it can survive
with half of it - unless a single hash bucket gets too big. This problem
also shows up if you NFS mount an XFS filesystem from Irix onto Linux as
most existing Irix filesystems use V1 directories.

For this reason version 2 directories were introduced, the directory offset
returned here is actually a real offset - into an address space which is
the leaf blocks of the directory btree. In this case we only get into
problems with directories bigger than 2 Tbytes.... I cannot remember how
holes in the leaf blocks are handled, but it does do some tricks there too.

Steve

> Without stable name space probably your only option is to add dummy entries
> into the btree holes instead of rebalancing the tree, and removing all holes
> when the last entry goes away.
>
> -Andi

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-29 17:18 ` Btree directories (Re: Status of HFS+ support) Matthew Wilcox
  2000-08-29 18:45   ` Steve Lord
  2000-08-30  2:22   ` flar
@ 2000-08-30 14:25   ` Chris Mason
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Mason @ 2000-08-30 14:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Halfmann, Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel

On 8/29/00, 1:18:51 PM, Matthew Wilcox <matthew@wil.cx> wrote regarding
Btree directories (Re: Status of HFS+ support):

[ 32 bit directory offsets ]

> Hmmm.. now the LFS patches have gone in and f_pos is now a 64-bit
quantity,
> this sounds more plausible.  I'd be curious to hear from the ReiserFS
> people how they solved this problem.

In reiserfs, the offset in the directory is a hash of the file name.  The
directories are sparse, so we could have a directory with 4 items,
offsets 1, 2, 32458, 2million.  The hash of a given item never changes,
new items can be inserted or removed from either side of any existing
item (except . and ..)

-chris

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 12:06 Btree directories (Re: Status of HFS+ support) Steve Lord
@ 2000-08-30 15:07 ` Andi Kleen
  2000-08-30 16:07   ` flar
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2000-08-30 15:07 UTC (permalink / raw)
  To: Steve Lord
  Cc: Andi Kleen, flar, Matthew Wilcox, Halfmann Klaus, Roman Zippel,
	linuxppc-dev, linux-fsdevel


On Wed, Aug 30, 2000 at 07:06:36AM -0500, Steve Lord wrote:
> >
> > In theory you could put as much state as you want into a private field
> > of the file structure of the directory. This just ignores two ugly things:
> > NFS and telldir. Both basically require you to generate stateless 32bit
> > cookies that safely walk your directory.
>
> Can't we at least pass the cookie generation and intepretation down to
> the filesystem layer, since at least NFS V3 allows bigger handles.

Only 64bit though, which may or may not be enough. If HFS indexes directly
with names instead of hash values it'll probably not be enough so they
would need to fall back to the no shrink hack.
Anyways, you probably cannot ignore telldir of 32bit programs.


-Andi

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 15:07 ` Andi Kleen
@ 2000-08-30 16:07   ` flar
  2000-08-30 20:49     ` David A. Gatwood
  0 siblings, 1 reply; 28+ messages in thread
From: flar @ 2000-08-30 16:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steve Lord, Andi Kleen, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

Andi Kleen wrote:
> Only 64bit though, which may or may not be enough. If HFS indexes directly
> with names instead of hash values it'll probably not be enough so they
> would need to fall back to the no shrink hack.
> Anyways, you probably cannot ignore telldir of 32bit programs.

Yes, that's the main problem with HFS/HFS+. All catalog entries are keyed
on a combination of parent directory ID and filename, because this is the
data MacOS asks for when a file is opened. These entries have to be sorted
inside the tree, and the sort rules are a little messy, since the string
compares are supposed to be case-insensitive. If you are opening a file,
or doing a lookup() it's very convenient. If you are doing readdir() it's
a massive headache. Basically you have to find the first entry with a
parent ID that matches the directory being read, and then read every
entry in order until you find one with a different parent. Because of
the sorting issue, it's not just a problem with shrinking the directory,
because growing the directory could cause the same problem. In fact,
growing or shrinking ANY directory on the filesystem could theoretically
change the entire tree by forcing the tree to add a new layer of index
nodes and rebalancing the tree. It gets very messy. There's a reason the
HFS filesystem code in the kernel has always had a habit of destroying
disks any time it has to do anything too complex.

	Brad Boyer
	flar@pants.nu

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 20:49     ` David A. Gatwood
@ 2000-08-30 20:45       ` Steve Lord
  2000-08-30 21:05       ` Alexander Viro
  2000-08-31  5:49       ` flar
  2 siblings, 0 replies; 28+ messages in thread
From: Steve Lord @ 2000-08-30 20:45 UTC (permalink / raw)
  To: David A. Gatwood
  Cc: flar, Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel


> In my mind, this really screams the need for a pure vnode API in addition
> to keeping the current API for more traditional filesystems.  You could
> then keep track of these complex keys with either a table or hashing
> method into slots/buckets with 32 bit identifiers, then provide these to
> the vnode layer.  Sounds much easier than trying to make this look like a
> more traditional fs.
>
> Maybe someone could create a filesystem of type vnodefs that simply
> provides, through the current VFS, a pure vnode API that could be used by
> sub-filesystems (which would then absolutely be required to definitively
> probe for fs type automatically...).  Thoughts?
>

It is not that clean at the moment, but we already have a vnode based system
hiding inside XFS, the linux inode has a vnode in the fs specific portion,
the linux inode ops call the VOPs. We still have as a work item to go back and
clean up this layer - it knows too much about XFS right now.

Not really sure if it maps onto what you are suggesting, but it is out there.

Steve


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 16:07   ` flar
@ 2000-08-30 20:49     ` David A. Gatwood
  2000-08-30 20:45       ` Steve Lord
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: David A. Gatwood @ 2000-08-30 20:49 UTC (permalink / raw)
  To: flar
  Cc: Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

On Wed, 30 Aug 2000 flar@allandria.com wrote:

> Yes, that's the main problem with HFS/HFS+. All catalog entries are keyed
> on a combination of parent directory ID and filename, because this is the
> data MacOS asks for when a file is opened. These entries have to be sorted
> inside the tree, and the sort rules are a little messy, since the string
> compares are supposed to be case-insensitive. If you are opening a file,
> or doing a lookup() it's very convenient. If you are doing readdir() it's
> a massive headache. Basically you have to find the first entry with a
> parent ID that matches the directory being read, and then read every
> entry in order until you find one with a different parent. Because of
> the sorting issue, it's not just a problem with shrinking the directory,
> because growing the directory could cause the same problem. In fact,
> growing or shrinking ANY directory on the filesystem could theoretically
> change the entire tree by forcing the tree to add a new layer of index
> nodes and rebalancing the tree. It gets very messy. There's a reason the
> HFS filesystem code in the kernel has always had a habit of destroying
> disks any time it has to do anything too complex.

In my mind, this really screams the need for a pure vnode API in addition
to keeping the current API for more traditional filesystems.  You could
then keep track of these complex keys with either a table or hashing
method into slots/buckets with 32 bit identifiers, then provide these to
the vnode layer.  Sounds much easier than trying to make this look like a
more traditional fs.

Maybe someone could create a filesystem of type vnodefs that simply
provides, through the current VFS, a pure vnode API that could be used by
sub-filesystems (which would then absolutely be required to definitively
probe for fs type automatically...).  Thoughts?

Also, let me see if I understand this... HFS+ basically needs a
hierarchial lock scheme with three states (shared, exclusive, unlocked) at
each level, right?  And I'm assuming that the tree balancing is for speed
reasons, and that an unbalanced tree doesn't rise to the level of
"corruption", right?

Later,
David

---------------------------------------------------------------------
A brief Haiku:

Microsoft is bad.
It seems secure at first glance.
Then you read your mail.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 20:49     ` David A. Gatwood
  2000-08-30 20:45       ` Steve Lord
@ 2000-08-30 21:05       ` Alexander Viro
  2000-08-30 21:51         ` David A. Gatwood
  2000-08-31  5:49       ` flar
  2 siblings, 1 reply; 28+ messages in thread
From: Alexander Viro @ 2000-08-30 21:05 UTC (permalink / raw)
  To: David A. Gatwood
  Cc: flar, Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

On Wed, 30 Aug 2000, David A. Gatwood wrote:

> In my mind, this really screams the need for a pure vnode API in addition
> to keeping the current API for more traditional filesystems.  You could
> then keep track of these complex keys with either a table or hashing
> method into slots/buckets with 32 bit identifiers, then provide these to
> the vnode layer.  Sounds much easier than trying to make this look like a
> more traditional fs.

??? Current API _is_ vnode one with namespace made visible to VFS and some
race-protection done there, along with generic checks like "thou shalt not
remove other's file from other's sticky directory even if you have write
permissions".

WTF is "traditional" filesystem? Inumber-based one? No such thing at the
VFS level.

VFS does not rely on inode numbers and it does not pass them around. Even
in ->lookup(). It passes the request "here's dentry with name foo and
parent bar, see if such file exists and say d_instantiate()". All other
requests are done in terms of such "filled" dentries. Period. VFS doesn't
care for inodes. You can have all inodes on your filesystem with random
->i_ino and/or constant ->i_ino with whatever internal search scheme you
like. They don't have to be anywhere near icache. Moreover, quite a few
filesystems work that way. VFS simply doesn't give a fuck.

Guys, WTF? Could you please read the (slightly outdated, whatever)
description of the current API _before_ discussing what should be changed?

That's bloody ridiculous - I challenge anyone who did fs work on 2.2, let
alone 2.4 to stand up and say how much common the thing has with 2.0.

"I don't know what's there, but it got to be replaced with $FOO, because
I think that currently it's $BAR or something" is just plain silly,
_especially_ when $FOO is pretty much what we have right now, with several
stupid things fixed and _NONE_ of them relevant to problems in question.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 21:51         ` David A. Gatwood
@ 2000-08-30 21:49           ` Alexander Viro
  2000-08-30 22:13             ` David A. Gatwood
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Viro @ 2000-08-30 21:49 UTC (permalink / raw)
  To: David A. Gatwood
  Cc: flar, Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel


On Wed, 30 Aug 2000, David A. Gatwood wrote:

> Also, race protection in the VFS layer doesn't make sense either, because
> how can the VFS possibly assume anything about the structure to know what
> it can and can't allow to happen concurrently?  And the sticky directory
> thing... that's the filesystem's responsibility to implement that, not the
> VFS layer.

Oh, really? How about rename()/rename() ones? I mean, those that were
_totally_ fucked up in 4.4BSD all over the friggin' place. How about
rename()/rmdir()? (ditto) Why should _that_ be duplicated into every fs,
with its own set of bugs in each of them?

> The VFS layer shouldn't even care if the underlying filesystem supports
> ownership, much less sticky directories.  It should simply pass
> information about the operation and who wants to do it to the filesystem
> and the filesystem should decide if that's a legal operation or not.
> Anything more depends on an inode-structured filesystem, or at least
> faking an inode structure for the VFS layer, which is a waste of system
> overhead with little or no obvious benefit.

Again, BS. You have ->permission() for that.

> > VFS does not rely on inode numbers and it does not pass them around. Even
>
> But it does, at least in 2.2.14.  I'm looking at the code right now.  The
> word inode appears 40 times in the dcache.c file alone.  The inode
> number is stored in the dentry as dentry->d_inode, and is used for several

Number? Look into dcache.h and fs.h. BTW, fs-independent part of struct
inode in Linux _is_ vnode. Normal, sane BSD/SunOS vnode.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 21:05       ` Alexander Viro
@ 2000-08-30 21:51         ` David A. Gatwood
  2000-08-30 21:49           ` Alexander Viro
  0 siblings, 1 reply; 28+ messages in thread
From: David A. Gatwood @ 2000-08-30 21:51 UTC (permalink / raw)
  To: Alexander Viro
  Cc: flar, Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

On Wed, 30 Aug 2000, Alexander Viro wrote:

> ??? Current API _is_ vnode one with namespace made visible to VFS and some
> race-protection done there, along with generic checks like "thou shalt not
> remove other's file from other's sticky directory even if you have write
> permissions".

I'm looking at it, and it doesn't look very vnode to me.  I see a buttload
of inode-oriented code all through the generic fs layer.  I'm not sure if
it's that way in 2.4, but I'm looking at 2.2.14 right now, and it's not as
simple as it could be, by any means.  You still have to have something
that resembles an inode.  For some filesystems, that makes no sense.

Also, race protection in the VFS layer doesn't make sense either, because
how can the VFS possibly assume anything about the structure to know what
it can and can't allow to happen concurrently?  And the sticky directory
thing... that's the filesystem's responsibility to implement that, not the
VFS layer.

The VFS layer shouldn't even care if the underlying filesystem supports
ownership, much less sticky directories.  It should simply pass
information about the operation and who wants to do it to the filesystem
and the filesystem should decide if that's a legal operation or not.
Anything more depends on an inode-structured filesystem, or at least
faking an inode structure for the VFS layer, which is a waste of system
overhead with little or no obvious benefit.

> WTF is "traditional" filesystem? Inumber-based one? No such thing at the
> VFS level.

Traditional in the unix sense, as in being designed based on inodes, a
superblock, etc. as opposed to, e.g. NFS based on handles, or HFS based
on... I have no idea what....  :-)

> VFS does not rely on inode numbers and it does not pass them around. Even

But it does, at least in 2.2.14.  I'm looking at the code right now.  The
word inode appears 40 times in the dcache.c file alone.  The inode
number is stored in the dentry as dentry->d_inode, and is used for several
checks in the dentry code as to whether it can, e.g., remove a dentry,
etc.  I don't see how you can say it doesn't rely on inodes.  If it
touches anything that even resmbles an inode, it's not pure vnodes,
period.  To make that sort of system work with non-inode-based filesystems
is a hack and will always be a hack.

David

---------------------------------------------------------------------
A brief Haiku:

Microsoft is bad.
It seems secure at first glance.
Then you read your mail.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 21:49           ` Alexander Viro
@ 2000-08-30 22:13             ` David A. Gatwood
  2000-08-30 22:17               ` Alexander Viro
  0 siblings, 1 reply; 28+ messages in thread
From: David A. Gatwood @ 2000-08-30 22:13 UTC (permalink / raw)
  To: Alexander Viro
  Cc: flar, Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

On Wed, 30 Aug 2000, Alexander Viro wrote:

> On Wed, 30 Aug 2000, David A. Gatwood wrote:
>
> > Also, race protection in the VFS layer doesn't make sense either, because
> > how can the VFS possibly assume anything about the structure to know what
> > it can and can't allow to happen concurrently?  And the sticky directory
> > thing... that's the filesystem's responsibility to implement that, not the
> > VFS layer.
>
> Oh, really? How about rename()/rename() ones? I mean, those that were
> _totally_ fucked up in 4.4BSD all over the friggin' place. How about
> rename()/rmdir()? (ditto) Why should _that_ be duplicated into every fs,
> with its own set of bugs in each of them?

How about a database filesystem that presents different view to different
clients.  User A may rename the "file" and it may only affect user A's
view.  User B could then rename that same file to a different filename.
It's simply not reasonable to assume global (system-wide) unix semantics
for a filesystem, even for obvious functions like rename or rmdir if you
want the fs API to be expandable to future filesystems that might come
along.

David

---------------------------------------------------------------------
A brief Haiku:

Microsoft is bad.
It seems secure at first glance.
Then you read your mail.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 22:13             ` David A. Gatwood
@ 2000-08-30 22:17               ` Alexander Viro
  2000-08-30 23:11                 ` David A. Gatwood
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Viro @ 2000-08-30 22:17 UTC (permalink / raw)
  To: David A. Gatwood
  Cc: flar, Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

On Wed, 30 Aug 2000, David A. Gatwood wrote:

> How about a database filesystem that presents different view to different
> clients.  User A may rename the "file" and it may only affect user A's
> view.  User B could then rename that same file to a different filename.
> It's simply not reasonable to assume global (system-wide) unix semantics
> for a filesystem, even for obvious functions like rename or rmdir if you
> want the fs API to be expandable to future filesystems that might come
> along.

So what? You have different views, you have separate dentry trees. More
power to you... dentry tree represents visible namespace, no matter what
stands behind it.

BTW, the race in question looks so: get two long branches and do
simultaneous rename() of the middle of first branch to the top of the
second and symmetric one. Notice that each of them is OK, but together
they would create a loop, detached from fs. Completely independent from
the filesystem type. And completely fscked up in *BSD. Yes, chews
filesystems.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 22:17               ` Alexander Viro
@ 2000-08-30 23:11                 ` David A. Gatwood
  2000-08-31  0:10                   ` Alexander Viro
  0 siblings, 1 reply; 28+ messages in thread
From: David A. Gatwood @ 2000-08-30 23:11 UTC (permalink / raw)
  To: Alexander Viro
  Cc: flar, Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

On Wed, 30 Aug 2000, Alexander Viro wrote:

> BTW, the race in question looks so: get two long branches and do
> simultaneous rename() of the middle of first branch to the top of the
> second and symmetric one. Notice that each of them is OK, but together
> they would create a loop, detached from fs. Completely independent from
> the filesystem type. And completely fscked up in *BSD. Yes, chews
> filesystems.

But it's not a race that the VFS should care about, if the filesystem does
its job properly.  The first one should take an exclusive lock on the
inode of the directory and its old and new parents (simultaneously or in
an ordered fashion so as to prevent deadlock), since it's writing to them.
When a second thread tries to move something higher in the tree, it must
make sure it is not moving a parent into one of its descendants.  Assuming
a hierarchial locking scheme, it will be unable to obtain the appropriate
locks to verify this, so it will sit there until the previous operation
completes.

Once the first move operation complete and release its exclusive lock, the
second operation will get its shared lock on the (now grandparent) inode
and see that it is about to do an illegal link loop and fail.  This should
all occur in the filesystem, transparent to the VFS, which should simply
be told that the directory to be moved no longer exists in that location
or that it is illegal to move a parent into its child.

As far as the VFS is concerned, a filesystem operation should look like an
atomic operation that either succeeds or fails.  It shouldn't care about
the address hierarchy at all.  Concurrency should be protected in the
filesystem, because that's the only place that really understands when
locks need to be used.

The VFS layer shouldn't have to check if an operation is legal before it
executes, because even if it were legal, that legality might change
between the check and the operation unless you lock the whole filesystem
in-between -- horrible for concurrency.  You only want to lock things
during the inode modifications, which could be much less than the time
needed for the entire operation.

And besides... in some filesystems, weird loops may be okay.  ;-)  No, I
can't contrive an example, but you can bet somebody's tried....  :-)

David

---------------------------------------------------------------------
A brief Haiku:

Microsoft is bad.
It seems secure at first glance.
Then you read your mail.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 23:11                 ` David A. Gatwood
@ 2000-08-31  0:10                   ` Alexander Viro
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Viro @ 2000-08-31  0:10 UTC (permalink / raw)
  To: David A. Gatwood
  Cc: flar, Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

On Wed, 30 Aug 2000, David A. Gatwood wrote:

> But it's not a race that the VFS should care about, if the filesystem does
> its job properly.  The first one should take an exclusive lock on the
> inode of the directory and its old and new parents (simultaneously or in
> an ordered fashion so as to prevent deadlock), since it's writing to them.
> When a second thread tries to move something higher in the tree, it must
> make sure it is not moving a parent into one of its descendants.  Assuming
> a hierarchial locking scheme, it will be unable to obtain the appropriate
> locks to verify this, so it will sit there until the previous operation
> completes.

Eww... So that check has to read all directories on the path from target
to the root? You are kidding. That, BTW, is one more place when having a
consistent view of namespace is a big win.

> Once the first move operation complete and release its exclusive lock, the
> second operation will get its shared lock on the (now grandparent) inode
> and see that it is about to do an illegal link loop and fail.  This should
> all occur in the filesystem, transparent to the VFS, which should simply
> be told that the directory to be moved no longer exists in that location
> or that it is illegal to move a parent into its child.

Several problems with that:
	* I don't think that a lot of fs writers are smarter than Kirk. He
missed that one.
	* Duplicating the generic code is evil. It _will_ be screwed, one
way or another. Heck, s/or/and/, actually ;-/
	* More complex code in filesystems.

4.4BSD is, as much as I love that kernel, badly screwed in VFS-related
parts (which, incidentially, was the reason why I'm dealing with Linux and
not *BSD, BSD bigot as I am).

	I'm yet to see clear benefits of vnode API as in 4.4 over the
vnodes-under-dcache as in Linux. Notice that filesystems do not care much
- in the directory part they are simply relieved from doing tons of
braindead checks that are duplicated all over the place in 4.4. If they
want to do extra checks - more power to them. Most of them doesn't,
though.

> As far as the VFS is concerned, a filesystem operation should look like an
> atomic operation that either succeeds or fails.  It shouldn't care about
> the address hierarchy at all.  Concurrency should be protected in the
> filesystem, because that's the only place that really understands when
> locks need to be used.
>
> The VFS layer shouldn't have to check if an operation is legal before it
> executes, because even if it were legal, that legality might change
> between the check and the operation unless you lock the whole filesystem
> in-between -- horrible for concurrency.  You only want to lock things
> during the inode modifications, which could be much less than the time
> needed for the entire operation.

Not really. First of all, we have enough state to not care about the
legality changes (in the cases we do check in VFS, that is). Internal data
structures are entirely on the responsibility of filesystems, as are
additional checks, etc. As for locking the things only when inode
changes... Fine, indeed, but that's _not_ the vnode model - check flags
for namei(9), especially LOCK_PARENT. The bottom line: there are things
that can be easily done in generic layer transparently for filesystems.
It turns out that maintaining the consistent namespace view is not only
possible, but quite simple. It frees filesystem from a lot of cruft that
is very easy to get wrong and it makes life actually much easier. There's
a lot of subtle crap that simply doesn't happen to fs on Linux (2.2/2.4)
and fixing it in every fs is _pain_.

> And besides... in some filesystems, weird loops may be okay.  ;-)  No, I

Detached ones? ;-)

Oh, well... Wait until I'll be done with documentation, will you? Yes,
large piece on translation from/to BSD terms will be there. And yes,
dcache+vnode design makes sense and is not more restrictive than pure
vnode one - it actually addresses some of the problems of the latter.

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-30 20:49     ` David A. Gatwood
  2000-08-30 20:45       ` Steve Lord
  2000-08-30 21:05       ` Alexander Viro
@ 2000-08-31  5:49       ` flar
  2000-08-31 10:33         ` Alexander Viro
  2 siblings, 1 reply; 28+ messages in thread
From: flar @ 2000-08-31  5:49 UTC (permalink / raw)
  To: David A. Gatwood
  Cc: Andi Kleen, Steve Lord, Matthew Wilcox, Halfmann Klaus,
	Roman Zippel, linuxppc-dev, linux-fsdevel

David A. Gatwood wrote:
> On Wed, 30 Aug 2000 flar@allandria.com wrote:
> > Yes, that's the main problem with HFS/HFS+. All catalog entries are keyed
> > on a combination of parent directory ID and filename, because this is the
> > data MacOS asks for when a file is opened. These entries have to be sorted
> > inside the tree, and the sort rules are a little messy, since the string
> > compares are supposed to be case-insensitive. If you are opening a file,
> > or doing a lookup() it's very convenient. If you are doing readdir() it's
> > a massive headache. Basically you have to find the first entry with a
> > parent ID that matches the directory being read, and then read every
> > entry in order until you find one with a different parent. Because of
> > the sorting issue, it's not just a problem with shrinking the directory,
> > because growing the directory could cause the same problem. In fact,
> > growing or shrinking ANY directory on the filesystem could theoretically
> > change the entire tree by forcing the tree to add a new layer of index
> > nodes and rebalancing the tree. It gets very messy. There's a reason the
> > HFS filesystem code in the kernel has always had a habit of destroying
> > disks any time it has to do anything too complex.
>
> In my mind, this really screams the need for a pure vnode API in addition
> to keeping the current API for more traditional filesystems.  You could
> then keep track of these complex keys with either a table or hashing
> method into slots/buckets with 32 bit identifiers, then provide these to
> the vnode layer.  Sounds much easier than trying to make this look like a
> more traditional fs.

Actually, one of the main points of HFS+ is that it's more like a traditional
UNIX filesystem than HFS was.

> Maybe someone could create a filesystem of type vnodefs that simply
> provides, through the current VFS, a pure vnode API that could be used by
> sub-filesystems (which would then absolutely be required to definitively
> probe for fs type automatically...).  Thoughts?

Doesn't sound any cleaner, and it certainly sounds less efficient.

> Also, let me see if I understand this... HFS+ basically needs a
> hierarchial lock scheme with three states (shared, exclusive, unlocked) at
> each level, right?  And I'm assuming that the tree balancing is for speed
> reasons, and that an unbalanced tree doesn't rise to the level of
> "corruption", right?

Well, I'll try to explain. The tree doesn't have to be perfectly balanced,
but there are some rules to follow. The first rule is that all keys have
to be sorted at all times, although there can be empty space in the catalog
between two keys for adding in new ones. The other thing is that all leaf
nodes in the tree have to have the same height. This causes some interesting
problems when you fill up the currently allocated space, since you have to
restructure the tree pretty considerably in order to create more possible
entries.

Deleting in many cases is cleaner and simpler than adding. In most
cases, when you delete an entry, you only have to shuffle around a
few entries in order to make the tree consistent. Basically, the disk
space for the tree is divided into nodes, and each node can have as
many entries as can fit, but it can only have a single contiguous block
of free space. When the free space is all allocated in a node, you can
rebalance the tree using existing nodes if the tree has become heavily
unbalanced. However, if there is very little free space in the tree,
you have to basically rebuild the thing from scratch with an extra
layer of indexing which makes the tree deeper and thus slower to
search. When a new filesystem is created, it only has the root of
the tree, which is also the only leaf node. When that fills up, you
create a new root which is an index node and split the single leaf
nodes into multiple leaf nodes, with each having a key in the index
node. This continues as long as you need more room for entries in
the tree. It's hard to summarize, but the documentation from Apple
has some great diagrams.

http://developer.apple.com/technotes/tn/tn1150.html

Most things in HFS+ can be mapped easily into VFS, and readdir() is
the only thing that is giving me any real problems. From what I could
tell, readdir() is somewhat messy for every filesystem anyway. I've
been doing my coding on 2.2.x mostly, but I've looked at the 2.4.x
interfaces. I personally think it looks a lot cleaner than the 2.2.x
for many things, particularly for read/write of files. I will in the
near future try to find all the 2.4 documentation that wasn't finished
the last time I looked and see what I can figure out of the stuff that
wasn't obvious from the code.

I'm not really sure I answered your questions, but I hope this explains
some of the strange situations with HFS+ and HFS (which mostly fits the
above description as well)

	Brad Boyer
	flar@pants.nu

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-31  5:49       ` flar
@ 2000-08-31 10:33         ` Alexander Viro
  2000-08-31 17:48           ` flar
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Viro @ 2000-08-31 10:33 UTC (permalink / raw)
  To: flar
  Cc: David A. Gatwood, Andi Kleen, Steve Lord, Matthew Wilcox,
	Halfmann Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel


On Wed, 30 Aug 2000 flar@allandria.com wrote:

> > > Yes, that's the main problem with HFS/HFS+. All catalog entries are keyed
> > > on a combination of parent directory ID and filename, because this is the
> > > data MacOS asks for when a file is opened. These entries have to be sorted
> > > inside the tree, and the sort rules are a little messy, since the string
> > > compares are supposed to be case-insensitive. If you are opening a file,
> > > or doing a lookup() it's very convenient. If you are doing readdir() it's
> > > a massive headache. Basically you have to find the first entry with a
> > > parent ID that matches the directory being read, and then read every
> > > entry in order until you find one with a different parent. Because of

OK.

> > > the sorting issue, it's not just a problem with shrinking the directory,
> > > because growing the directory could cause the same problem. In fact,

Shrinking/growing the same directory is not going to happen while you
are in ->readdir(). You have ->i_zombie on it, so...

> > > growing or shrinking ANY directory on the filesystem could theoretically
> > > change the entire tree by forcing the tree to add a new layer of index
> > > nodes and rebalancing the tree. It gets very messy. There's a reason the

	Urgh. Could you show me your code? I think that a variation of the
scheme used for fatfs may work here. Basically, you can keep hash of
directory entries and do (find; unhash; copy; rehash) whenever you move
one. inodes would keep references to such entries, ditto for readdir
having a "cursor" entry. Rebalancing would go under rw-semaphore (writer)
and all tree lookups/scans under the same semaphore (reader). That way you
can get rid of the tree search in ->write_inode(), BTW.


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-31 10:33         ` Alexander Viro
@ 2000-08-31 17:48           ` flar
  2000-08-31 19:54             ` Alexander Viro
  0 siblings, 1 reply; 28+ messages in thread
From: flar @ 2000-08-31 17:48 UTC (permalink / raw)
  To: Alexander Viro
  Cc: David A. Gatwood, Andi Kleen, Steve Lord, Matthew Wilcox,
	Halfmann Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel

Alexander Viro wrote:
> Shrinking/growing the same directory is not going to happen while you
> are in ->readdir(). You have ->i_zombie on it, so...

That's a bit of a help. The 2.2 docs rarely mentioned what was locked
when, and finding all the various locks in the code wasn't easy.

> 	Urgh. Could you show me your code? I think that a variation of the
> scheme used for fatfs may work here. Basically, you can keep hash of

You can find some of my code at http://devel.linuxppc.org/pub/users/flar/
which is in the form of patches to a slightly old 2.2 kernel. However,
since the patches hardly touch any existing files, it will likely apply
cleanly to any 2.2 kernel. I have some minor changes that aren't there
yet, but nothing that changes the basic outline or actually makes it
work properly. I didn't look much at fatfs because I didn't think it
would be relevant, but I'll take a look now that I have some idea what
I should look for there. My code is based on a mix of code stolen from
ext2 and hfs and a few ideas I thought of on my own.

> directory entries and do (find; unhash; copy; rehash) whenever you move
> one. inodes would keep references to such entries, ditto for readdir
> having a "cursor" entry. Rebalancing would go under rw-semaphore (writer)
> and all tree lookups/scans under the same semaphore (reader). That way you
> can get rid of the tree search in ->write_inode(), BTW.

I'm currently ignoring locking almost entirely, because my support is
read only, but I did have a few ideas on where I would need locking.
Hopefully my code isn't completely wrong, but I'd be happy to accept
any advice on where I really do need to worry about locking. I still
haven't read the latest documentation that I've been told got put
into 2.4 but I also haven't started porting my code to 2.4 for lack
of time.

	Brad Boyer
	flar@pants.nu

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-31 17:48           ` flar
@ 2000-08-31 19:54             ` Alexander Viro
  2000-08-31 22:09               ` flar
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Viro @ 2000-08-31 19:54 UTC (permalink / raw)
  To: flar
  Cc: David A. Gatwood, Andi Kleen, Steve Lord, Matthew Wilcox,
	Halfmann Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel


On Thu, 31 Aug 2000 flar@allandria.com wrote:

> Alexander Viro wrote:
> > Shrinking/growing the same directory is not going to happen while you
> > are in ->readdir(). You have ->i_zombie on it, so...
>
> That's a bit of a help. The 2.2 docs rarely mentioned what was locked
> when, and finding all the various locks in the code wasn't easy.

See Documentation/filesystems/Locking. You probably want the sections on
inode_operations and file_operations first. The thing is _not_ a tutorial
on methods - it's a reference on locking rules. IOW, it had been written
so that one could easily keep it in sync with the code and check what any
given method can expect to be locked.

BTW, readdir() prototype looks fishy - we don't really need struct file*
there, but fixing that is a 2.5 matter.

> > 	Urgh. Could you show me your code? I think that a variation of the
> > scheme used for fatfs may work here. Basically, you can keep hash of

> You can find some of my code at http://devel.linuxppc.org/pub/users/flar/
> which is in the form of patches to a slightly old 2.2 kernel. However,
> since the patches hardly touch any existing files, it will likely apply
> cleanly to any 2.2 kernel. I have some minor changes that aren't there
> yet, but nothing that changes the basic outline or actually makes it
> work properly. I didn't look much at fatfs because I didn't think it
> would be relevant, but I'll take a look now that I have some idea what
> I should look for there. My code is based on a mix of code stolen from
> ext2 and hfs and a few ideas I thought of on my own.

Thanks, I'll look at the patch. In fatfs, keep in mind that you probably
don't need fake inumbers - if you have some permanent ID, that is.
Relevant stuff is in fs/fat/inode.c. Comments there are slightly obsolete
- I didn't update them in the last couple of versions before it went into
the tree.

> > directory entries and do (find; unhash; copy; rehash) whenever you move
> > one. inodes would keep references to such entries, ditto for readdir
> > having a "cursor" entry. Rebalancing would go under rw-semaphore (writer)
> > and all tree lookups/scans under the same semaphore (reader). That way you
> > can get rid of the tree search in ->write_inode(), BTW.
>
> I'm currently ignoring locking almost entirely, because my support is
> read only, but I did have a few ideas on where I would need locking.
> Hopefully my code isn't completely wrong, but I'd be happy to accept
> any advice on where I really do need to worry about locking. I still
> haven't read the latest documentation that I've been told got put
> into 2.4 but I also haven't started porting my code to 2.4 for lack
> of time.

Hopefully it will be easier than 2.2 - Ingo did a fine work on regular
files and that killed a lot of complexity. HFS+ doesn't have sparse files,
right? If so - you'll only need a function that
	a) can take a block number less or equal than the last block in
file and will set ->b_blocknr (of the buffer_head passed to you) to the
disk block number.
	b) can add a block to the end of file and return its number the
same way.
Oh, and you'll still need ->truncate() ;-/
	Everything else is done by library helper functions - see
the current fatfs, hfs or hpfs for details (address_space_operations
methods).
	Directories are essentially the same as they used to be in 2.2,
except that handling of rmdir'ed busy directories is handled in VFS, so
you don't have to worry about them - if rmdir() or rename() over directory
return 0, directory is marked dead and you can forget about anything
except the ->lookup() on it - every other method will be stopped.
	Symlinks are _way_ easier - if you provide a ->readpage() for them
you are done - just use page_readlink() and page_follow_link(). If you
have them in-core (as it is in case of fast symlinks on ext2, autofs ones,
etc.) - pass the contents to vfs_readlink() and vfs_follow_link() (page
variant does exactly that after reading the page - it just was so common
that it deserved library functions of its own).
	Special files don't need any treatment at all; you simply call
init_special_inode() when you read or create an inode of such beast (see
examples in ext2_read_inode() and ext2_mknod(); any filesystem that
supports specials will go as example - they all do the same thing).


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-31 19:54             ` Alexander Viro
@ 2000-08-31 22:09               ` flar
  2000-09-01  2:40                 ` Alexander Viro
  0 siblings, 1 reply; 28+ messages in thread
From: flar @ 2000-08-31 22:09 UTC (permalink / raw)
  To: Alexander Viro
  Cc: David A. Gatwood, Andi Kleen, Steve Lord, Matthew Wilcox,
	Halfmann Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel

Alexander Viro wrote:
> See Documentation/filesystems/Locking. You probably want the sections on
> inode_operations and file_operations first. The thing is _not_ a tutorial
> on methods - it's a reference on locking rules. IOW, it had been written
> so that one could easily keep it in sync with the code and check what any
> given method can expect to be locked.
>
> BTW, readdir() prototype looks fishy - we don't really need struct file*
> there, but fixing that is a 2.5 matter.

I'll take a look. Thanks.

> Thanks, I'll look at the patch. In fatfs, keep in mind that you probably
> don't need fake inumbers - if you have some permanent ID, that is.
> Relevant stuff is in fs/fat/inode.c. Comments there are slightly obsolete
> - I didn't update them in the last couple of versions before it went into
> the tree.

Yes, HFS and HFS+ both have a CNID for each file or directory. It just isn't
entirely straightforward to search the catalog by this value, since the key
is a combination of parentCNID and name. The CNID is guaranteed to be unique
across a filesystem and there are a handful that are reserved for system
files/directories such as the root (2) and the "parent of root" (1).

> Hopefully it will be easier than 2.2 - Ingo did a fine work on regular
> files and that killed a lot of complexity. HFS+ doesn't have sparse files,
> right? If so - you'll only need a function that

HFS and HFS+ do not allow sparse files.

> 	a) can take a block number less or equal than the last block in
> file and will set ->b_blocknr (of the buffer_head passed to you) to the
> disk block number.

This is just a table lookup in the file extents record held in the catalog
entry record or a search in the extents overflow tree if the file has
too many non-contiguous blocks to fit in the records held in the catalog.
Each extent contains a start block and a block count, and there is a fixed
amount of space set aside for extents inside the catalog entry. The extents
overflow tree is another btree like the catalog file, but is keyed on
CNID, fork, and file offset.

> 	b) can add a block to the end of file and return its number the
> same way.

This isn't much worse, but in the case of a massively fragmented filesystem,
could involve growing the extents overflow tree.

> Oh, and you'll still need ->truncate() ;-/

I'll keep that in mind.  :)

> 	Everything else is done by library helper functions - see
> the current fatfs, hfs or hpfs for details (address_space_operations
> methods).

I did notice the address space stuff in the short time I looked at
the 2.4 code. It looked like it really cleaned up the helper functions.

> 	Directories are essentially the same as they used to be in 2.2,
> except that handling of rmdir'ed busy directories is handled in VFS, so
> you don't have to worry about them - if rmdir() or rename() over directory
> return 0, directory is marked dead and you can forget about anything
> except the ->lookup() on it - every other method will be stopped.

What exactly is marked dead? The inode? I'm not really sure where this
saves in actual code. Any place in particular to compare between 2.2 and 2.4?

> 	Symlinks are _way_ easier - if you provide a ->readpage() for them
> you are done - just use page_readlink() and page_follow_link(). If you
> have them in-core (as it is in case of fast symlinks on ext2, autofs ones,
> etc.) - pass the contents to vfs_readlink() and vfs_follow_link() (page
> variant does exactly that after reading the page - it just was so common
> that it deserved library functions of its own).
> 	Special files don't need any treatment at all; you simply call
> init_special_inode() when you read or create an inode of such beast (see
> examples in ext2_read_inode() and ext2_mknod(); any filesystem that
> supports specials will go as example - they all do the same thing).

Well, since MacOS 8/9 don't support special files, I'm waiting until I
can get my own copy of MacOS X before I support all the features like
special files and links.

	Brad Boyer
	flar@pants.nu

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-08-31 22:09               ` flar
@ 2000-09-01  2:40                 ` Alexander Viro
  2000-09-01  3:52                   ` flar
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Viro @ 2000-09-01  2:40 UTC (permalink / raw)
  To: flar
  Cc: David A. Gatwood, Andi Kleen, Steve Lord, Matthew Wilcox,
	Halfmann Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel


On Thu, 31 Aug 2000 flar@allandria.com wrote:

> Yes, HFS and HFS+ both have a CNID for each file or directory. It just isn't
> entirely straightforward to search the catalog by this value, since the key
> is a combination of parentCNID and name. The CNID is guaranteed to be unique
> across a filesystem and there are a handful that are reserved for system
> files/directories such as the root (2) and the "parent of root" (1).

OK. Are they preserved by rename()?

> > 	Directories are essentially the same as they used to be in 2.2,
> > except that handling of rmdir'ed busy directories is handled in VFS, so
> > you don't have to worry about them - if rmdir() or rename() over directory
> > return 0, directory is marked dead and you can forget about anything
> > except the ->lookup() on it - every other method will be stopped.
>
> What exactly is marked dead? The inode? I'm not really sure where this
> saves in actual code. Any place in particular to compare between 2.2 and 2.4?

You don't have to check that directory is not busy in rmdir()/rename().
Back in 2.2 filesystems either had to do that, or they had to detect
attempts to read/create in/etc. dead directories. Not much of a save,
but...


** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-09-01  2:40                 ` Alexander Viro
@ 2000-09-01  3:52                   ` flar
  2000-09-02  4:04                     ` Tony Mantler
  0 siblings, 1 reply; 28+ messages in thread
From: flar @ 2000-09-01  3:52 UTC (permalink / raw)
  To: Alexander Viro
  Cc: David A. Gatwood, Andi Kleen, Steve Lord, Matthew Wilcox,
	Halfmann Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel

Alexander Viro wrote:
> On Thu, 31 Aug 2000 flar@allandria.com wrote:
>
> > Yes, HFS and HFS+ both have a CNID for each file or directory. It just isn't
> > entirely straightforward to search the catalog by this value, since the key
> > is a combination of parentCNID and name. The CNID is guaranteed to be unique
> > across a filesystem and there are a handful that are reserved for system
> > files/directories such as the root (2) and the "parent of root" (1).
>
> OK. Are they preserved by rename()?

There isn't anything preventing it. A rename() is messy enough because of the
headache of moving around entries in the btree and updating all the implicit
usage of the catalog key which depends on the filename. I don't think the
MacOS actually preserves the CNID on a rename, tho. It's something that I
think would be about the same complexity either way. The only thing you
save by keeping the same CNID is changes to the extents overflow tree,
which is usually empty anyway. (My code doesn't do rename() yet...)

> You don't have to check that directory is not busy in rmdir()/rename().
> Back in 2.2 filesystems either had to do that, or they had to detect
> attempts to read/create in/etc. dead directories. Not much of a save,
> but...

Well, I don't think every filesystem did it properly in 2.2, then. I'm
not entirely sure, but I think HFS wasn't quite as careful as it should
have been about such things. As I recall, the HFS code has wandered around
between different maintainers and is a little messy as a result.

I did finally get a recent source tree for 2.4, and I have to say that the
documenation looks a lot better these days.

	Brad Boyer
	flar@pants.nu

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Btree directories (Re: Status of HFS+ support)
@ 2000-09-01 17:05 Halfmann, Klaus
  2000-09-01 17:40 ` flar
  0 siblings, 1 reply; 28+ messages in thread
From: Halfmann, Klaus @ 2000-09-01 17:05 UTC (permalink / raw)
  To: 'flar@allandria.com ', 'viro@math.psu.edu '
  Cc: 'dgatwood@deepspace.mklinux.org ', 'ak@suse.de ',
	'lord@sgi.com ', 'matthew@wil.cx ',
	Halfmann, Klaus, 'roman@augan.com ',
	'linuxppc-dev@lists.linuxppc.org ',
	'linux-fsdevel@vger.kernel.org '


...
> OK. Are they [the CNISs] preserved by rename()?

> There isn't anything preventing it. A rename() is messy enough
> because of the headache of moving around entries in the btree
> and updating all the implicit usage of the catalog key which
> depends on the filename. I don't think the
> MacOS actually preserves the CNID on a rename, tho. It's something that
> I think would be about the same complexity either way. The only thing you
> save by keeping the same CNID is changes to the extents overflow tree,
> which is usually empty anyway. (My code doesn't do rename() yet...)

Brad, Im sure you are wrong on this one, try the following in MacOS:
create an alias to a folder and then renam it, the alias will stiil work
So Im sure the CNIDs will not change ..

Klaus

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-09-01 17:05 Halfmann, Klaus
@ 2000-09-01 17:40 ` flar
  0 siblings, 0 replies; 28+ messages in thread
From: flar @ 2000-09-01 17:40 UTC (permalink / raw)
  To: Halfmann, Klaus
  Cc: 'viro@math.psu.edu ',
	'dgatwood@deepspace.mklinux.org ', 'ak@suse.de ',
	'lord@sgi.com ', 'matthew@wil.cx ',
	Halfmann Klaus, 'roman@augan.com ',
	'linuxppc-dev@lists.linuxppc.org ',
	'linux-fsdevel@vger.kernel.org '

Halfmann, Klaus wrote:
> Brad, Im sure you are wrong on this one, try the following in MacOS:
> create an alias to a folder and then renam it, the alias will stiil work
> So Im sure the CNIDs will not change ..

Alias files in the MacOS aren't that simple. The MacOS stores quite a bit
of info in an alias record, and tries to find the file even if a bunch of
stuff has changed. Take a look at

http://developer.apple.com/techpubs/mac/Files/Files-340.html

Note that it mentions that even files restored from backups can be easily
found by alias.

I went searching, and found a description of the funcions FSpExchangeFiles
and PBExchangeFiles in

http://developer.apple.com/technotes/tn/tn1041.html

which specifically states that when files are moved, the CNID is what is
changed effectively, because the disk block allocation data is what is
really moved around, not the filename. This allows them to not have to
do any major changes to a catalog record. All they have to do is change
a few particular fields in the entry. (Note this info is for HFS...)

	Brad Boyer
	flar@pants.nu

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Btree directories (Re: Status of HFS+ support)
  2000-09-01  3:52                   ` flar
@ 2000-09-02  4:04                     ` Tony Mantler
  0 siblings, 0 replies; 28+ messages in thread
From: Tony Mantler @ 2000-09-02  4:04 UTC (permalink / raw)
  To: flar, Alexander Viro
  Cc: David A. Gatwood, Andi Kleen, Steve Lord, Matthew Wilcox,
	Halfmann Klaus, Roman Zippel, linuxppc-dev, linux-fsdevel

At 10:52 PM -0500 8/31/2000, flar@allandria.com wrote:
>Alexander Viro wrote:
>> On Thu, 31 Aug 2000 flar@allandria.com wrote:
>>
>> > Yes, HFS and HFS+ both have a CNID for each file or directory. It just
>>isn't
>> > entirely straightforward to search the catalog by this value, since
>>the key
>> > is a combination of parentCNID and name. The CNID is guaranteed to be
>>unique
>> > across a filesystem and there are a handful that are reserved for system
>> > files/directories such as the root (2) and the "parent of root" (1).
>>
>> OK. Are they preserved by rename()?
>
>There isn't anything preventing it. A rename() is messy enough because of the
>headache of moving around entries in the btree and updating all the implicit
>usage of the catalog key which depends on the filename. I don't think the
>MacOS actually preserves the CNID on a rename, tho. It's something that I
>think would be about the same complexity either way. The only thing you
>save by keeping the same CNID is changes to the extents overflow tree,
>which is usually empty anyway. (My code doesn't do rename() yet...)
[...]

I think the point of the CNIDs is that they are preserved on rename
(atleast, MacOS preserves them) so you can do things like open a file, do
some stuff, close it for a bit, then refind it by it's fine number and do
some more stuff with it, completely immune to the user shuffling files
around.

An example of this is the MacOS itself, which uses the CNID extensivley to
aid in resolving alias files. With aliases (iirc), MacOS stores the Parent
name and CNID, and the file name and CNID (and appleshare server info and a
bunch of other stuff), then when it resolves the alias, it looks up each
combination of these attributes, in whatever order apple figured would be
most reliable and least surprising to the user, in order to find the
original file. This makes alias files much more durable than a symlink
(pointing to a specific file/folder rather that just a pathname), and more
portable than a hard link (can point to any file or folder across any
volumes). Of course, it has neither the anonimity of a symlink nor the
absolute transparency of a hard link, but that's life.

(btw, my email is currently down for a bit pending some dns updates)

Cheers - Tony 'Nicoya' Mantler :)

--
Tony "Nicoya" Mantler - Renaissance Nerd Extraordinaire - nicoya@apia.dhs.org
Winnipeg, Manitoba, Canada           --           http://nicoya.feline.pp.se/

** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2000-09-02  4:04 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-08-30 12:06 Btree directories (Re: Status of HFS+ support) Steve Lord
2000-08-30 15:07 ` Andi Kleen
2000-08-30 16:07   ` flar
2000-08-30 20:49     ` David A. Gatwood
2000-08-30 20:45       ` Steve Lord
2000-08-30 21:05       ` Alexander Viro
2000-08-30 21:51         ` David A. Gatwood
2000-08-30 21:49           ` Alexander Viro
2000-08-30 22:13             ` David A. Gatwood
2000-08-30 22:17               ` Alexander Viro
2000-08-30 23:11                 ` David A. Gatwood
2000-08-31  0:10                   ` Alexander Viro
2000-08-31  5:49       ` flar
2000-08-31 10:33         ` Alexander Viro
2000-08-31 17:48           ` flar
2000-08-31 19:54             ` Alexander Viro
2000-08-31 22:09               ` flar
2000-09-01  2:40                 ` Alexander Viro
2000-09-01  3:52                   ` flar
2000-09-02  4:04                     ` Tony Mantler
  -- strict thread matches above, loose matches on Subject: below --
2000-09-01 17:05 Halfmann, Klaus
2000-09-01 17:40 ` flar
2000-08-29 16:40 Status of HFS+ support (was hfs support for blocksize != 512) Halfmann, Klaus
2000-08-29 17:18 ` Btree directories (Re: Status of HFS+ support) Matthew Wilcox
2000-08-29 18:45   ` Steve Lord
2000-08-29 21:25     ` Matthew Wilcox
2000-08-30  2:22   ` flar
2000-08-30  8:35     ` Andi Kleen
2000-08-30 14:25   ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).