linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Expected getdents behaviour
@ 2005-09-15 13:57 Akshat Aranya
  2005-09-15 14:03 ` Peter Staubach
  2005-09-15 14:07 ` Anton Altaparmakov
  0 siblings, 2 replies; 32+ messages in thread
From: Akshat Aranya @ 2005-09-15 13:57 UTC (permalink / raw)
  To: linux-fsdevel

I noticed that bonnie++, in its directory tests, does the following on
a large directory

open(dir);
while (getdents() != 0)
{
   unlink all the returned entries from getdents
}
close(dir);

My question is whether the filesystem's readdir is expected to
consider the offset value in the second readdir to still be valid,
given that entries from the directory were deleted after the first
readdir.

Akshat

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 13:57 Expected getdents behaviour Akshat Aranya
@ 2005-09-15 14:03 ` Peter Staubach
  2005-09-15 14:07 ` Anton Altaparmakov
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Staubach @ 2005-09-15 14:03 UTC (permalink / raw)
  To: Akshat Aranya; +Cc: linux-fsdevel

Akshat Aranya wrote:

>I noticed that bonnie++, in its directory tests, does the following on
>a large directory
>
>open(dir);
>while (getdents() != 0)
>{
>   unlink all the returned entries from getdents
>}
>close(dir);
>
>My question is whether the filesystem's readdir is expected to
>consider the offset value in the second readdir to still be valid,
>given that entries from the directory were deleted after the first
>readdir.
>

This "usually" works.  Most file systems do not directory compaction when
removing entries.  However, some do, so to be safe, the algorithm needs to
take into account this and also entries which can not be removed.

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 13:57 Expected getdents behaviour Akshat Aranya
  2005-09-15 14:03 ` Peter Staubach
@ 2005-09-15 14:07 ` Anton Altaparmakov
  2005-09-15 14:12   ` Anton Altaparmakov
  2005-09-16  1:28   ` tridge
  1 sibling, 2 replies; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 14:07 UTC (permalink / raw)
  To: Akshat Aranya; +Cc: linux-fsdevel

On Thu, 2005-09-15 at 09:57 -0400, Akshat Aranya wrote:
> I noticed that bonnie++, in its directory tests, does the following on
> a large directory
> 
> open(dir);
> while (getdents() != 0)
> {
>    unlink all the returned entries from getdents
> }
> close(dir);
> 
> My question is whether the filesystem's readdir is expected to
> consider the offset value in the second readdir to still be valid,
> given that entries from the directory were deleted after the first
> readdir.

This would fail horribly on ntfs for example (once it supports file
deletion anyway) since the offset value of all entries changes when the
directory is modified.

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 14:07 ` Anton Altaparmakov
@ 2005-09-15 14:12   ` Anton Altaparmakov
  2005-09-15 14:45     ` Miklos Szeredi
                       ` (2 more replies)
  2005-09-16  1:28   ` tridge
  1 sibling, 3 replies; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 14:12 UTC (permalink / raw)
  To: Akshat Aranya; +Cc: linux-fsdevel

On Thu, 2005-09-15 at 15:07 +0100, Anton Altaparmakov wrote:
> On Thu, 2005-09-15 at 09:57 -0400, Akshat Aranya wrote:
> > I noticed that bonnie++, in its directory tests, does the following on
> > a large directory
> > 
> > open(dir);
> > while (getdents() != 0)
> > {
> >    unlink all the returned entries from getdents
> > }
> > close(dir);
> > 
> > My question is whether the filesystem's readdir is expected to
> > consider the offset value in the second readdir to still be valid,
> > given that entries from the directory were deleted after the first
> > readdir.
> 
> This would fail horribly on ntfs for example (once it supports file
> deletion anyway) since the offset value of all entries changes when the
> directory is modified.

Oops.  I forgot to answer your question.  Yes, the filesystem needs to
consider the offset value in the second readdir to still be valid.  You
cannot keep rewinding back to zero every time you make a modification or
you would keep returning entries you have already returned and never
make any progress if e.g. some user does this in a loop at the same
time:

while 1; do
	touch blah
	rm blah
done

It would be a very effective DOS that any user can trigger.

Bonnie++'s code is just complete crap...  It is the author's fault that
it will not work on filesystems where the directory entries are not in
fixed locations...

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 14:12   ` Anton Altaparmakov
@ 2005-09-15 14:45     ` Miklos Szeredi
  2005-09-15 15:17       ` Anton Altaparmakov
  2005-09-15 15:51     ` Theodore Ts'o
  2005-09-15 18:08     ` Nikita Danilov
  2 siblings, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2005-09-15 14:45 UTC (permalink / raw)
  To: aia21; +Cc: aaranya, linux-fsdevel

> Oops.  I forgot to answer your question.  Yes, the filesystem needs to
> consider the offset value in the second readdir to still be valid.  You
> cannot keep rewinding back to zero every time you make a modification or
> you would keep returning entries you have already returned and never
> make any progress if e.g. some user does this in a loop at the same
> time:
> 
> while 1; do
> 	touch blah
> 	rm blah
> done
> 
> It would be a very effective DOS that any user can trigger.
> 
> Bonnie++'s code is just complete crap...  It is the author's fault that
> it will not work on filesystems where the directory entries are not in
> fixed locations...

Why is it crap?  'rm -rf' does exactly the same.

It's up to the filesystem to keep track of the file position in the
directory, regardless of any deletions or additions.

Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 14:45     ` Miklos Szeredi
@ 2005-09-15 15:17       ` Anton Altaparmakov
  2005-09-15 16:41         ` Jan Blunck
  2005-09-15 16:51         ` Miklos Szeredi
  0 siblings, 2 replies; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 15:17 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: aaranya, linux-fsdevel

On Thu, 2005-09-15 at 16:45 +0200, Miklos Szeredi wrote:
> > Oops.  I forgot to answer your question.  Yes, the filesystem needs to
> > consider the offset value in the second readdir to still be valid.  You
> > cannot keep rewinding back to zero every time you make a modification or
> > you would keep returning entries you have already returned and never
> > make any progress if e.g. some user does this in a loop at the same
> > time:
> > 
> > while 1; do
> > 	touch blah
> > 	rm blah
> > done
> > 
> > It would be a very effective DOS that any user can trigger.
> > 
> > Bonnie++'s code is just complete crap...  It is the author's fault that
> > it will not work on filesystems where the directory entries are not in
> > fixed locations...
> 
> Why is it crap?  'rm -rf' does exactly the same.

Then that is broken, too.  But have you ever looked?  I just did and you
are partially wrong.  It especially handles the case correctly by doing
a "rewinddir()" on the directory before doing another readdir() after it
has done an unlike.  However you are right in that it only enables this
code at ./configure time so on most Linux systems will not have it
enabled in which case it will indeed be broken on filesystems like ntfs.

> It's up to the filesystem to keep track of the file position in the
> directory, regardless of any deletions or additions.

That is completely untrue.  It is up to the user.  The file position in
a directory is completely up to the caller and in particular when a
directory is modified the file position in that directory is not changed
by the filesystem.  Doing this would in fact be impossible since the
creation/deletion in the kernel has no access to all the "struct file"s
with which a directory has been opened.

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 14:12   ` Anton Altaparmakov
  2005-09-15 14:45     ` Miklos Szeredi
@ 2005-09-15 15:51     ` Theodore Ts'o
  2005-09-15 16:52       ` Bryan Henderson
  2005-09-15 20:25       ` Anton Altaparmakov
  2005-09-15 18:08     ` Nikita Danilov
  2 siblings, 2 replies; 32+ messages in thread
From: Theodore Ts'o @ 2005-09-15 15:51 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Akshat Aranya, linux-fsdevel

On Thu, Sep 15, 2005 at 03:12:38PM +0100, Anton Altaparmakov wrote:
> Oops.  I forgot to answer your question.  Yes, the filesystem needs to
> consider the offset value in the second readdir to still be valid.  You
> cannot keep rewinding back to zero every time you make a modification or
> you would keep returning entries you have already returned and never
> make any progress if e.g. some user does this in a loop at the same
> time:

POSIX (or SUSv3) does not guarantee the offset data structure to be
the dirent structure at all.  So a portable application should not
count of d_off on being present.

That being said, it *is* fair game to assume that an application
should be able to call readdir() repeatedly and get all files in the
directory once and exactly once, even if another process is unlinking
files or adding files while the readdir is going on.  The only thing
which is unspecified is whether a file which is deleted or added after
the application has started iterating over the directory will be
included or not.  (Think about it; Unix is a multi-user, time-sharing
system.  Nothing else makes sense, since otherwise programs that used
readdir() would randomly break if a directory is modified by another
process at the same time.)

In fact, POSIX requires that telldir() and seekdir() do the right
thing even if directory entries are added or deleted between the
telldir() and seekdir().  Yes, this is hard on directories which use
something more sophisticated a simple linked list to store their
directory entries (like a b-tree, for example).  However, it is
required by POSIX/SUSv3.  The JFS filesystem, for example, uses an
entirely separate b-tree just to guarantee telldir() and seekdir()
indexes behave properly in the presence of file inserts and removals.

> Bonnie++'s code is just complete crap...  It is the author's fault that
> it will not work on filesystems where the directory entries are not in
> fixed locations...

If Bonnie++ is relying on d_off, then yet.  But in fact, if Bonniee++
is just doing a series of readdir()'s, and the filesystem doesn't do
the right thing in the face of concurrent deletes or file creates, it
is in fact the filesystem which is broken.  It doesn't matter if the
filesystem is using a sophisticated b-tree data structure; it still
has to do the right thing.  There is a lot of hair in ext3, jfs, xfs,
reiserfs, etc. in order to guarantee this to be the case, since it is
expected by Unix applications, and it is required by the standards
specifications.

(I often curse the POSIX specifiers for including telldir/seekdir into
the standards, since it's hell to support, but it's there, and there
are applications which rely on it --- unfortunately.)

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 15:17       ` Anton Altaparmakov
@ 2005-09-15 16:41         ` Jan Blunck
  2005-09-15 17:46           ` Jörn Engel
  2005-09-15 20:28           ` Anton Altaparmakov
  2005-09-15 16:51         ` Miklos Szeredi
  1 sibling, 2 replies; 32+ messages in thread
From: Jan Blunck @ 2005-09-15 16:41 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: miklos, aaranya, linux-fsdevel

On Thu, Sep 15, Anton Altaparmakov wrote:

> 
> That is completely untrue.  It is up to the user.  The file position in
> a directory is completely up to the caller and in particular when a
> directory is modified the file position in that directory is not changed
> by the filesystem.  Doing this would in fact be impossible since the
> creation/deletion in the kernel has no access to all the "struct file"s
> with which a directory has been opened.
> 

No, thats not true. The file position in a regular (flat) file is completely
up to the user. "Thanks" to the short-comings of POSIX on not defining the
functions on directories, the file position in a directory is completely up to
the file system. Due to that fact, modern file systems don't need to provide
a "flat" view (in terms of sequential access to a directory file) on
them. Although, the file system must guarantee, that readdir's must not fail
because of changes directory contents. AFAIR ext2 is achieving this through
not shrinking directories while the filesystem is online.

Regards,
	Jan

-- 
Jan Blunck                                               jblunck@suse.de
SuSE LINUX AG - A Novell company
Maxfeldstr. 5                                          +49-911-74053-608
D-90409 Nürnberg                                      http://www.suse.de
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 15:17       ` Anton Altaparmakov
  2005-09-15 16:41         ` Jan Blunck
@ 2005-09-15 16:51         ` Miklos Szeredi
  2005-09-15 21:17           ` Anton Altaparmakov
  1 sibling, 1 reply; 32+ messages in thread
From: Miklos Szeredi @ 2005-09-15 16:51 UTC (permalink / raw)
  To: aia21; +Cc: miklos, aaranya, linux-fsdevel

> Then that is broken, too.  But have you ever looked?

I looked at the strace yes.

> I just did and you are partially wrong.  It especially handles the
> case correctly by doing a "rewinddir()" on the directory before
> doing another readdir() after it has done an unlike.  However you
> are right in that it only enables this code at ./configure time so
> on most Linux systems will not have it enabled in which case it will
> indeed be broken on filesystems like ntfs.

Well, having a configure option is not very consoling for the average
luser faced with a non-working rm binary.

> That is completely untrue.  It is up to the user.

I think the normal behavior for a filesystem is that for each entry
which existed at the call to opendir, and continues to exist until the
whole directory is read, will be returned exactly once.  For any other
case (entry added or removed during the directory reading) it is
unspecified whether the entry is returned once or not returned.

SUS doesn't say this in so many words, but there are hints:

  The type DIR, which is defined in the <dirent.h> header, represents
  a directory stream, which is an ordered sequence of all the
  directory entries in a particular directory. Directory entries
  represent files; files may be removed from a directory or added to a
  directory asynchronously to the operation of readdir().

  [...]

  If a file is removed from or added to the directory after the most
  recent call to opendir() or rewinddir(), whether a subsequent call
  to readdir() returns an entry for that file is unspecified.

> The file position in a directory is completely up to the caller and
> in particular when a directory is modified the file position in that
> directory is not changed by the filesystem.  Doing this would in
> fact be impossible since the creation/deletion in the kernel has no
> access to all the "struct file"s with which a directory has been
> opened.

And does not need to.  The only thing the filesystem must do is to
guarantee, that IF the file position wasn't changed since the last
getdents() call (i.e. no lseek(), or zero relative offset lseek() was
done), the directory reading should continue from the place where it
has been left off.

Look at dcache_readdir() in fs/libfs.c for a way to deal with exactly
this problem.

Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 15:51     ` Theodore Ts'o
@ 2005-09-15 16:52       ` Bryan Henderson
  2005-09-15 16:57         ` Jeremy Allison
  2005-09-15 20:50         ` Anton Altaparmakov
  2005-09-15 20:25       ` Anton Altaparmakov
  1 sibling, 2 replies; 32+ messages in thread
From: Bryan Henderson @ 2005-09-15 16:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Akshat Aranya, Anton Altaparmakov, linux-fsdevel

Since others have opined differently, I'd like to add my voice to Ted's 
and say his explanation is exactly my understanding too; I couldn't have 
said it any better, right down to the part of cursing our reality.

We hate to admit that this is the requirement of directory positions, 
because it's so hard to implement, but it's what is expected.  When I 
explain the directory positioning requirement (which actually shows up in 
lots of places in software design besides filesystem directories), I 
always say the ideal is to think of the directory position as a white card 
within a deck of playing cards.  When you say "tell me about the next 
card," you get the card directly on top of the white card and that card 
then moves under the white card.  People can insert and delete cards all 
around the white card, but the sequence is still well defined.

In actuality, it's acceptable to fall short of that ideal and ignore some 
or all insertions or deletions that happened after a pass started, and 
many implementations do.  But what you'd have a really hard time getting 
accepted is returning the same entry twice or skipping an entry that was 
always there, as someone steps through the directory.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 16:52       ` Bryan Henderson
@ 2005-09-15 16:57         ` Jeremy Allison
  2005-09-15 20:51           ` Anton Altaparmakov
  2005-09-15 20:50         ` Anton Altaparmakov
  1 sibling, 1 reply; 32+ messages in thread
From: Jeremy Allison @ 2005-09-15 16:57 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: Theodore Ts'o, Akshat Aranya, Anton Altaparmakov,
	linux-fsdevel

On Thu, Sep 15, 2005 at 09:52:33AM -0700, Bryan Henderson wrote:
> Since others have opined differently, I'd like to add my voice to Ted's 
> and say his explanation is exactly my understanding too; I couldn't have 
> said it any better, right down to the part of cursing our reality.

Well the reality is there are applications out there that
create more than 1,000,000 files in a directory, and in
order to support that efficiently having a working
seekdir()/telldir() is a *MUST*.

Jeremy.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 16:41         ` Jan Blunck
@ 2005-09-15 17:46           ` Jörn Engel
  2005-09-15 18:19             ` Theodore Ts'o
  2005-09-15 21:00             ` Anton Altaparmakov
  2005-09-15 20:28           ` Anton Altaparmakov
  1 sibling, 2 replies; 32+ messages in thread
From: Jörn Engel @ 2005-09-15 17:46 UTC (permalink / raw)
  To: Jan Blunck; +Cc: Anton Altaparmakov, miklos, aaranya, linux-fsdevel

On Thu, 15 September 2005 18:41:10 +0200, Jan Blunck wrote:
> 
> No, thats not true. The file position in a regular (flat) file is completely
> up to the user. "Thanks" to the short-comings of POSIX on not defining the
> functions on directories, the file position in a directory is completely up to
> the file system. Due to that fact, modern file systems don't need to provide
> a "flat" view (in terms of sequential access to a directory file) on
> them. Although, the file system must guarantee, that readdir's must not fail
> because of changes directory contents. AFAIR ext2 is achieving this through
> not shrinking directories while the filesystem is online.

Correct.  But you're missing half of the problem.

The other half is getting O(1)  or O(ln(n)) performance for lookup on
a large directory.  Solution to this half alone would be to use a
balanced tree for directory entries.  Now, using a self-balancing tree
would not solve your half of the problem, so it is not a full
solution.

Ext2, as far as I understand it, solved both halves by having both a
hashtree and the sequential list.  Getdents is walking the list and
ignoring entries with the "deleted" flag.  Lookup is using the
hashtree, which points to the list for actual data.

Jörn

-- 
All art is but imitation of nature.
-- Lucius Annaeus Seneca
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 14:12   ` Anton Altaparmakov
  2005-09-15 14:45     ` Miklos Szeredi
  2005-09-15 15:51     ` Theodore Ts'o
@ 2005-09-15 18:08     ` Nikita Danilov
  2005-09-16 11:23       ` Miklos Szeredi
  2 siblings, 1 reply; 32+ messages in thread
From: Nikita Danilov @ 2005-09-15 18:08 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: linux-fsdevel

Anton Altaparmakov writes:

[...]

 > 
 > Bonnie++'s code is just complete crap...  It is the author's fault that
 > it will not work on filesystems where the directory entries are not in
 > fixed locations...

Unfortunately Single UNIX Specification states that "offset"
(telldir(3)'s result) is only invalidated by calls to rewinddir(3).

It is true that for certain (heh, almost _all_ non-s5fs-descending) file
systems it is hard to maintain directory offsets stable in the face on
concurrent directory modifications, but that's what current standards
require and what user level software relies upon.

 > 
 > Best regards,
 > 
 >         Anton

Nikita.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 17:46           ` Jörn Engel
@ 2005-09-15 18:19             ` Theodore Ts'o
  2005-09-15 21:04               ` Anton Altaparmakov
                                 ` (2 more replies)
  2005-09-15 21:00             ` Anton Altaparmakov
  1 sibling, 3 replies; 32+ messages in thread
From: Theodore Ts'o @ 2005-09-15 18:19 UTC (permalink / raw)
  To: J?rn Engel; +Cc: Jan Blunck, Anton Altaparmakov, miklos, aaranya, linux-fsdevel

On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote:
> Correct.  But you're missing half of the problem.
> 
> The other half is getting O(1)  or O(ln(n)) performance for lookup on
> a large directory.  Solution to this half alone would be to use a
> balanced tree for directory entries.  Now, using a self-balancing tree
> would not solve your half of the problem, so it is not a full
> solution.
> 
> Ext2, as far as I understand it, solved both halves by having both a
> hashtree and the sequential list.  Getdents is walking the list and
> ignoring entries with the "deleted" flag.  Lookup is using the
> hashtree, which points to the list for actual data.

Actually, no.  What we do is return the directory entries in hash sort
order, by walking the btree.  And we use the hash as the offset
returned in d_off and via telldir().  The reason for this?  So that if
we add files which causes a node to split, that we still only return
files in the directory once and only once.  If we traversed the tree
in sequential order only, then some files could be returned multiple
times, and some files returned not at all after a node split
operation.

This is part of the "hair" that is needed in order to have a
POSIX-compliant (and non-buggy) readdir/telldir implementation.  The
downside of returning files in hash sort order is that an application
which tries to stat or open every single file in a directory in
readdir() order will no longer do so in rough creation order, and this
can cause the disk heads to seek all over the inode table to read the
inode.  This performance loss can fixed by sorting the files by inode
number before opening them.  This can be done in the application, or
via a LD_PRELOAD library, which reads the entire directory at
opendir() time, and then sorts the directory and returns it in sorted
inode order.  (Why can't we do this in the kernel? because for a very
large directory, we would have to allocate a huge amount of
non-swappable memory to store all of the filenames in the directory.)

As I said, implementing POSIX-compliant
opendir/readdir/telldir/seekdir is tricky for some filesystems, and
filesystem engineers often curse POSIX telldir/seekdir.  But they are
a fact of life, and we have to deal with them.

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 15:51     ` Theodore Ts'o
  2005-09-15 16:52       ` Bryan Henderson
@ 2005-09-15 20:25       ` Anton Altaparmakov
  2005-09-16  3:39         ` Theodore Ts'o
  1 sibling, 1 reply; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 20:25 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Akshat Aranya, linux-fsdevel

On Thu, 15 Sep 2005, Theodore Ts'o wrote:
> On Thu, Sep 15, 2005 at 03:12:38PM +0100, Anton Altaparmakov wrote:
> > Oops.  I forgot to answer your question.  Yes, the filesystem needs to
> > consider the offset value in the second readdir to still be valid.  You
> > cannot keep rewinding back to zero every time you make a modification or
> > you would keep returning entries you have already returned and never
> > make any progress if e.g. some user does this in a loop at the same
> > time:
> 
> POSIX (or SUSv3) does not guarantee the offset data structure to be
> the dirent structure at all.  So a portable application should not
> count of d_off on being present.

Why should f_pos be the dirent structure?!?  That would be completely 
insane...

> That being said, it *is* fair game to assume that an application
> should be able to call readdir() repeatedly and get all files in the
> directory once and exactly once, even if another process is unlinking
> files or adding files while the readdir is going on.  The only thing

I disagree.  readdir() is a completely brain damaged interface and it is 
not fair game to assume that at all...

> which is unspecified is whether a file which is deleted or added after
> the application has started iterating over the directory will be
> included or not.  (Think about it; Unix is a multi-user, time-sharing
> system.  Nothing else makes sense, since otherwise programs that used
> readdir() would randomly break if a directory is modified by another
> process at the same time.)

Only if they are written badly!  Also it depends what you mean by break.  
For example "while (i = readdir); do rm i; done" would not break, it would 
simply miss some files.  It would not produce an error.

If it were properly written when it is finished it would check if there 
are still things to delete and start again if so and keep looping until 
there are none left.

Anything else _cannot_ work unless opendir() results in a read_lock on the 
directory and it is only unlocked on close.  Nothing else is sane and will 
result in a trivial DOS by any user on the system where a fs has to play 
tricks.

> In fact, POSIX requires that telldir() and seekdir() do the right
> thing even if directory entries are added or deleted between the
> telldir() and seekdir().  Yes, this is hard on directories which use

Sorry what do you mean?  They will and can work fine.  You use telldir to 
give you and offset (f_pos) and seekdir puts the offset into f_pos.  
Nothing more nothing less.  If you have removed files or added files in 
between two readdir calls (irrelevant whether you used seek/telldir) the 
f_pos will just now point in the wrong place and you will get some entries 
duplicated or you will miss some because you did not rewind back to 0 
after the change and because the directory was not locked against 
modifications.

> something more sophisticated a simple linked list to store their
> directory entries (like a b-tree, for example).  However, it is

Yes, ntfs uses a B tree.

> required by POSIX/SUSv3.  The JFS filesystem, for example, uses an

Er, have you read it?  To quote from "IEEE Std 1003.1, 2004 Edition", 
seekdir, from the informative "rationale" section:

<quote>
The original standard developers perceived that there were restrictions on 
the use of the seekdir() and telldir() functions related to implementation 
details, and for that reason these functions need not be supported on all 
POSIX-conforming systems. They are required on implementations supporting 
the XSI extension.

One of the perceived problems of implementation is that returning to a 
given point in a directory is quite difficult to describe formally, in 
spite of its intuitive appeal, when systems that use B-trees, hashing 
functions, or other similar mechanisms to order their directories are 
considered. The definition of seekdir() and telldir() does not specify 
whether, when using these interfaces, a given directory entry will be seen 
at all, or more than once.

On systems not supporting these functions, their capability can sometimes 
be accomplished by saving a filename found by readdir() and later using 
rewinddir() and a loop on readdir() to relocate the position from which 
the filename was saved.
</quote>

Thus any application relying on f_pos in a directory to be meaningful is 
broken by design and even POSIX says so.  Heck seek/telldir is not even 
required in POSIX unless you implement he XSI extension (I admit I have 
no idea what XSI is)!  (It says so above...)

> entirely separate b-tree just to guarantee telldir() and seekdir()
> indexes behave properly in the presence of file inserts and removals.

So any user can cause DOS/OOM by doing a: "while 1; do opendir(); 
readdir(); done" on a really big directory (note how I am never closing 
the directory)...  What a fantastic filesystem that is!  All the sheep are 
jumping off the bridge, lets jump, too!  I think not...

> > Bonnie++'s code is just complete crap...  It is the author's fault that
> > it will not work on filesystems where the directory entries are not in
> > fixed locations...
> 
> If Bonnie++ is relying on d_off, then yet.  But in fact, if Bonniee++
> is just doing a series of readdir()'s, and the filesystem doesn't do
> the right thing in the face of concurrent deletes or file creates, it
> is in fact the filesystem which is broken.  It doesn't matter if the

I still disagree.  The standards are broken if they require that from 
readdir().  Obviously at least someone understands given the seekdir() 
description.

> filesystem is using a sophisticated b-tree data structure; it still
> has to do the right thing.  There is a lot of hair in ext3, jfs, xfs,
> reiserfs, etc. in order to guarantee this to be the case, since it is
> expected by Unix applications, and it is required by the standards
> specifications.
>
> (I often curse the POSIX specifiers for including telldir/seekdir into
> the standards, since it's hell to support, but it's there, and there
> are applications which rely on it --- unfortunately.)

seekdir()/telldir() are no problem as they are meaningless and POSIX 
agrees.

readdir() is the problem.  It is _impossible_ to do what POSIX demands 
using readdir without some form of lock to say "directory cannot be 
modified".  Or if not a lock then a snapshot.  That is exactly what it is 
asking for!  I guess that would be the only way to support it.  Snapshot 
the directory and internally queue all modifications or apply them using 
COW.  But the problem even then is hwo do you know when the user has 
finished calling readdir().  There is no guarantee they will keep going 
till EOD is reached.  There is not even any guarantee the user will close 
the directory they opened.  Again, this would be a DOS and cause OOM in no 
time on a huge directory.

Maybe I am missing something...  How would you suggest to work around the 
above described problems?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 16:41         ` Jan Blunck
  2005-09-15 17:46           ` Jörn Engel
@ 2005-09-15 20:28           ` Anton Altaparmakov
  1 sibling, 0 replies; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 20:28 UTC (permalink / raw)
  To: Jan Blunck; +Cc: miklos, aaranya, linux-fsdevel

On Thu, 15 Sep 2005, Jan Blunck wrote:
> On Thu, Sep 15, Anton Altaparmakov wrote:
> > That is completely untrue.  It is up to the user.  The file position in
> > a directory is completely up to the caller and in particular when a
> > directory is modified the file position in that directory is not changed
> > by the filesystem.  Doing this would in fact be impossible since the
> > creation/deletion in the kernel has no access to all the "struct file"s
> > with which a directory has been opened.
> 
> No, thats not true. The file position in a regular (flat) file is completely
> up to the user. "Thanks" to the short-comings of POSIX on not defining the
> functions on directories, the file position in a directory is completely up to
> the file system. Due to that fact, modern file systems don't need to provide
> a "flat" view (in terms of sequential access to a directory file) on

We are talking past each other and actually agreeing with each other.  Our 
definitions of "up to the user" and "up to the fs" are at opposites.

> them. Although, the file system must guarantee, that readdir's must not fail
> because of changes directory contents. AFAIR ext2 is achieving this through
> not shrinking directories while the filesystem is online.

ntfs uses B trees on disk for directories.  The tree must be balanced at 
all times and each insert/delete will be a balancing tree operation.  
This can completely change the tree shape and hence the on disk data.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 16:52       ` Bryan Henderson
  2005-09-15 16:57         ` Jeremy Allison
@ 2005-09-15 20:50         ` Anton Altaparmakov
  2005-09-15 23:41           ` Bryan Henderson
  1 sibling, 1 reply; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 20:50 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: Theodore Ts'o, Akshat Aranya, linux-fsdevel

On Thu, 15 Sep 2005, Bryan Henderson wrote:
> Since others have opined differently, I'd like to add my voice to Ted's 
> and say his explanation is exactly my understanding too; I couldn't have 
> said it any better, right down to the part of cursing our reality.
> 
> We hate to admit that this is the requirement of directory positions, 
> because it's so hard to implement, but it's what is expected.  When I 
> explain the directory positioning requirement (which actually shows up in 
> lots of places in software design besides filesystem directories), I 
> always say the ideal is to think of the directory position as a white card 
> within a deck of playing cards.  When you say "tell me about the next 
> card," you get the card directly on top of the white card and that card 
> then moves under the white card.  People can insert and delete cards all 
> around the white card, but the sequence is still well defined.
> 
> In actuality, it's acceptable to fall short of that ideal and ignore some 
> or all insertions or deletions that happened after a pass started, and 
> many implementations do.  But what you'd have a really hard time getting 
> accepted is returning the same entry twice or skipping an entry that was 
> always there, as someone steps through the directory.

That may be so but AFAICS it is impossible to implement without causing a 
user triggerable OOM/DOS when you have a filesystem where directory 
entries are static but are a B tree for example.  The problem comes from 
the fact that there is no way to tell for how long that white card is 
valid to use your analogy.  You would have to create one the first time 
you see a readdir() from a particular file descriptor on the directory.  
But when are you free to get rid of it?  There is no way to tell.  When 
the directory is closed I hear you say, but what if it never gets closed 
because the user is malicious?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 16:57         ` Jeremy Allison
@ 2005-09-15 20:51           ` Anton Altaparmakov
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 20:51 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Bryan Henderson, Theodore Ts'o, Akshat Aranya, linux-fsdevel

On Thu, 15 Sep 2005, Jeremy Allison wrote:
> On Thu, Sep 15, 2005 at 09:52:33AM -0700, Bryan Henderson wrote:
> > Since others have opined differently, I'd like to add my voice to Ted's 
> > and say his explanation is exactly my understanding too; I couldn't have 
> > said it any better, right down to the part of cursing our reality.
> 
> Well the reality is there are applications out there that
> create more than 1,000,000 files in a directory, and in
> order to support that efficiently having a working
> seekdir()/telldir() is a *MUST*.

And exactly because of such large directories you cannot provide either a 
readdir or seekdir/telldir without some kind of locking on the dircetory 
you are working with.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 17:46           ` Jörn Engel
  2005-09-15 18:19             ` Theodore Ts'o
@ 2005-09-15 21:00             ` Anton Altaparmakov
  2005-09-15 21:15               ` Charles P. Wright
  1 sibling, 1 reply; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 21:00 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Jan Blunck, miklos, aaranya, linux-fsdevel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3104 bytes --]

On Thu, 15 Sep 2005, [iso-8859-1] Jörn Engel wrote:

> On Thu, 15 September 2005 18:41:10 +0200, Jan Blunck wrote:
> > 
> > No, thats not true. The file position in a regular (flat) file is completely
> > up to the user. "Thanks" to the short-comings of POSIX on not defining the
> > functions on directories, the file position in a directory is completely up to
> > the file system. Due to that fact, modern file systems don't need to provide
> > a "flat" view (in terms of sequential access to a directory file) on
> > them. Although, the file system must guarantee, that readdir's must not fail
> > because of changes directory contents. AFAIR ext2 is achieving this through
> > not shrinking directories while the filesystem is online.
> 
> Correct.  But you're missing half of the problem.
> 
> The other half is getting O(1)  or O(ln(n)) performance for lookup on
> a large directory.  Solution to this half alone would be to use a
> balanced tree for directory entries.  Now, using a self-balancing tree
> would not solve your half of the problem, so it is not a full
> solution.
> 
> Ext2, as far as I understand it, solved both halves by having both a
> hashtree and the sequential list.  Getdents is walking the list and
> ignoring entries with the "deleted" flag.  Lookup is using the
> hashtree, which points to the list for actual data.

That is not too far from what ntfs does, too.  It is a b tree (i.e. 
analogous to the ext2 hashtree) but on disk it is a flat file (analogous 
to the ext2 sequential list).  The problem comes when you add or delete an 
entry the order of everything in the flat file is suddenly changed 
(compare it to taking the linked list apart and putting it back together 
in a different order).  ntfs_readdir() instead of walking the b tree, 
simply reads the linear file and returns the entries in the order it finds 
them on disk.  This is very efficient as it involves no seeking around, no 
tree walking, or anything like that.  So there is no way to tell what the 
next entry should have been before the removal/addition as it may now be a 
completely different one.

After all this cursing at the standards I think I may have found a way but 
it cuases readdir to be very inefficient.  It might work to walk the b 
tree instead of just returning entries from the flat directory file data 
and to take a reference on the directory entry corresponding to f_pos so 
it cannot go away and then when readdir is called again resume from there 
in the b tree.  The only problem again is when the reference to the 
directory entry will be released in the case that no further readdir and 
no close ever comes.  It causes the fs to be unmountable for example 
and the file belonging to the directory entry to be undeletable...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 18:19             ` Theodore Ts'o
@ 2005-09-15 21:04               ` Anton Altaparmakov
  2005-09-16  7:50                 ` Nikita Danilov
  2005-09-15 21:47               ` Jörn Engel
  2005-09-16  7:29               ` Nikita Danilov
  2 siblings, 1 reply; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 21:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: J?rn Engel, Jan Blunck, miklos, aaranya, linux-fsdevel

On Thu, 15 Sep 2005, Theodore Ts'o wrote:
> On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote:
> > Correct.  But you're missing half of the problem.
> > 
> > The other half is getting O(1)  or O(ln(n)) performance for lookup on
> > a large directory.  Solution to this half alone would be to use a
> > balanced tree for directory entries.  Now, using a self-balancing tree
> > would not solve your half of the problem, so it is not a full
> > solution.
> > 
> > Ext2, as far as I understand it, solved both halves by having both a
> > hashtree and the sequential list.  Getdents is walking the list and
> > ignoring entries with the "deleted" flag.  Lookup is using the
> > hashtree, which points to the list for actual data.
> 
> Actually, no.  What we do is return the directory entries in hash sort
> order, by walking the btree.  And we use the hash as the offset
> returned in d_off and via telldir().  The reason for this?  So that if
> we add files which causes a node to split, that we still only return
> files in the directory once and only once.  If we traversed the tree
> in sequential order only, then some files could be returned multiple
> times, and some files returned not at all after a node split
> operation.

I don't know how your hash works but on ntfs this does not work like that.  
There is no parameter which would be able to tell you where you are in the 
b tree, unless you take a reference on an entry so it cannot be removed 
(rename==remove btw) but how does the reference get released later?

> This is part of the "hair" that is needed in order to have a
> POSIX-compliant (and non-buggy) readdir/telldir implementation.  The
> downside of returning files in hash sort order is that an application
> which tries to stat or open every single file in a directory in
> readdir() order will no longer do so in rough creation order, and this
> can cause the disk heads to seek all over the inode table to read the
> inode.  This performance loss can fixed by sorting the files by inode
> number before opening them.  This can be done in the application, or
> via a LD_PRELOAD library, which reads the entire directory at
> opendir() time, and then sorts the directory and returns it in sorted
> inode order.  (Why can't we do this in the kernel? because for a very
> large directory, we would have to allocate a huge amount of
> non-swappable memory to store all of the filenames in the directory.)
> 
> As I said, implementing POSIX-compliant
> opendir/readdir/telldir/seekdir is tricky for some filesystems, and
> filesystem engineers often curse POSIX telldir/seekdir.  But they are
> a fact of life, and we have to deal with them.

Sure but given a large enough directory you will OOM the system rather 
than return the correct result which is why I keep saying the interface is 
broken...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 21:00             ` Anton Altaparmakov
@ 2005-09-15 21:15               ` Charles P. Wright
  2005-09-15 21:19                 ` Anton Altaparmakov
  0 siblings, 1 reply; 32+ messages in thread
From: Charles P. Wright @ 2005-09-15 21:15 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: aaranya, linux-fsdevel

On Thu, 2005-09-15 at 22:00 +0100, Anton Altaparmakov wrote:
> The only problem again is when the reference to the 
> directory entry will be released in the case that no further readdir and 
> no close ever comes.  It causes the fs to be unmountable for example 
> and the file belonging to the directory entry to be undeletable...
Anton,

If no further readdir "ever" comes, and the file is never closed, the
file system will not be unmountable to begin with.  It is just like
opening and never closing a normal file, you keep the file system busy.

I don't see the connection between the extra per-open-file state you
need to save, and the argument that this opens up up for a DoS.

Charles


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 16:51         ` Miklos Szeredi
@ 2005-09-15 21:17           ` Anton Altaparmakov
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 21:17 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: aaranya, linux-fsdevel

On Thu, 15 Sep 2005, Miklos Szeredi wrote:
> > The file position in a directory is completely up to the caller and
> > in particular when a directory is modified the file position in that
> > directory is not changed by the filesystem.  Doing this would in
> > fact be impossible since the creation/deletion in the kernel has no
> > access to all the "struct file"s with which a directory has been
> > opened.
> 
> And does not need to.  The only thing the filesystem must do is to
> guarantee, that IF the file position wasn't changed since the last
> getdents() call (i.e. no lseek(), or zero relative offset lseek() was
> done), the directory reading should continue from the place where it
> has been left off.

But how do you define "the place where it has been left off"?  What you 
are saying is hand waving.  No actual meaning behind those words.  
Example:

f_pos = logical byte offset inside the linear on-disk directory stream.  
(This is what ntfs uses.)

So if you do not change f_pos, the next readdir() will continue at the 
same logical byte offset inside the linear on-disk directory stream, i.e. 
it will continue at "the place where it has been left off".  Which is what 
you specified!  Notice the problem?  There is no guarantee whatsoever that 
after a rm or create, this offset inside the on-disk directory stream 
points to the same directory entry any more since the entries will have 
been moved around.  Even worse the very entry it was pointing to may have 
been removed altogether so its place will have been filled by a different 
entry.

> Look at dcache_readdir() in fs/libfs.c for a way to deal with exactly
> this problem.

AFAICS this only works because it is a linked list.  But I am rather tired 
now so I may well be missing something.  That function is not very easy to 
understand...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 21:15               ` Charles P. Wright
@ 2005-09-15 21:19                 ` Anton Altaparmakov
  0 siblings, 0 replies; 32+ messages in thread
From: Anton Altaparmakov @ 2005-09-15 21:19 UTC (permalink / raw)
  To: Charles P. Wright; +Cc: aaranya, linux-fsdevel

On Thu, 15 Sep 2005, Charles P. Wright wrote:
> On Thu, 2005-09-15 at 22:00 +0100, Anton Altaparmakov wrote:
> > The only problem again is when the reference to the 
> > directory entry will be released in the case that no further readdir and 
> > no close ever comes.  It causes the fs to be unmountable for example 
> > and the file belonging to the directory entry to be undeletable...
> Anton,
> 
> If no further readdir "ever" comes, and the file is never closed, the
> file system will not be unmountable to begin with.  It is just like
> opening and never closing a normal file, you keep the file system busy.

Er, yes, that is true.
 
> I don't see the connection between the extra per-open-file state you
> need to save, and the argument that this opens up up for a DoS.

When I say DoS I mean "user triggerable OOM" on a large dirctory.  Say you 
have a directory with 2^30 files in it and you need to snapshot it or 
something each time a user does opendir() + readir().  You would OOM very 
quickly...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 18:19             ` Theodore Ts'o
  2005-09-15 21:04               ` Anton Altaparmakov
@ 2005-09-15 21:47               ` Jörn Engel
  2005-09-16  7:29               ` Nikita Danilov
  2 siblings, 0 replies; 32+ messages in thread
From: Jörn Engel @ 2005-09-15 21:47 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Blunck, Anton Altaparmakov, miklos, aaranya, linux-fsdevel

On Thu, 15 September 2005 14:19:47 -0400, Theodore Ts'o wrote:
> On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote:
> > Correct.  But you're missing half of the problem.
> > 
> > The other half is getting O(1)  or O(ln(n)) performance for lookup on
> > a large directory.  Solution to this half alone would be to use a
> > balanced tree for directory entries.  Now, using a self-balancing tree
> > would not solve your half of the problem, so it is not a full
> > solution.
> > 
> > Ext2, as far as I understand it, solved both halves by having both a
> > hashtree and the sequential list.  Getdents is walking the list and
> > ignoring entries with the "deleted" flag.  Lookup is using the
> > hashtree, which points to the list for actual data.
> 
> Actually, no.  What we do is return the directory entries in hash sort
> order, by walking the btree.  And we use the hash as the offset
> returned in d_off and via telldir().  The reason for this?  So that if
> we add files which causes a node to split, that we still only return
> files in the directory once and only once.  If we traversed the tree
> in sequential order only, then some files could be returned multiple
> times, and some files returned not at all after a node split
> operation.

You are right.  I understood what ext2 was doing, saw a huge problem
in lookup performance (O(n)) and was given a very brief explanation of
hashtrees - which apparently did not match reality.

Still wouldn't a hybrid design with both a list and a tree work?  It
appears to be just a different style of hair, not any prettier and
just as sufficient to cover all bald spots.

Jörn

-- 
A quarrel is quickly settled when deserted by one party; there is
no battle unless there be two.
-- Seneca
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 20:50         ` Anton Altaparmakov
@ 2005-09-15 23:41           ` Bryan Henderson
  0 siblings, 0 replies; 32+ messages in thread
From: Bryan Henderson @ 2005-09-15 23:41 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Akshat Aranya, aia21, linux-fsdevel, Theodore Ts'o

>The problem comes from 
>the fact that there is no way to tell for how long that white card is 
>valid to use your analogy.  You would have to create one the first time 
>you see a readdir() from a particular file descriptor on the directory. 
>But when are you free to get rid of it?

So it's analogous to the file pointer for a file - as long as you have a 
file open, you're occupying kernel memory because the kernel has to 
remember where you last read so it knows where to read next.  That's why 
systems have resource limits on things like number of open files per 
process, number of processes per system, etc.  They don't actually close 
the hole, but they're apparently good enough.

It could still be an issue in Linux implementation, though.  ISTR the 
filesystem drivers don't see the directory read sessions, which means they 
can't keep private stream state, so that the only information they can use 
to find a place in a directory sequence is that little position integer. 
People have found ways to make that enough information to fully describe 
the white card, but I know it isn't easy.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 14:07 ` Anton Altaparmakov
  2005-09-15 14:12   ` Anton Altaparmakov
@ 2005-09-16  1:28   ` tridge
  1 sibling, 0 replies; 32+ messages in thread
From: tridge @ 2005-09-16  1:28 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Akshat Aranya, linux-fsdevel

At the risk of thowing a spanner in the works, Samba relies pretty
heavily on seekdir()/telldir()/unlink() interacting in "nice"
ways. 

The most problematic case is OS/2 clients, which use a bizarre pattern
for removing all files in a directory. I've coded up the equivalent
set of posix directory operations for a OS/2 client deleting a
directory in:

  http://samba.org/ftp/unpacked/junkcode/os2_delete.c

what they do is:

 1) open directory
 2) read 100 entries, asking for the 'telldir' location of each
 3) delete the first 4 of those entries
 4) seekdir to the offset returned by the 4th entry (now deleted)
 5) read another 100 entries
 6) go to step 3

this continues until the read of the directory returns no entries. At
this point the client assumes the directory is empty.

This works fine on Linux. On FreeBSD systems the way libc telldir()
works means that the above pattern uses a huge amount of memory, and
ends up not deleting all entries in the directory, so we ended up
replacing telldir() on FreeBSD inside Samba4. See:

  http://samba.org/ftp/unpacked/samba4/source/lib/replace/repdir/

So if this discussion leads to any changes in getdents(), please don't
use the FreeBSD method (it allocates a linked list of all telldir()
results, and doesn't correctly cope with deletions during a directory
scan). 

Cheers, Tridge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 20:25       ` Anton Altaparmakov
@ 2005-09-16  3:39         ` Theodore Ts'o
  2005-09-16 11:57           ` Dave Kleikamp
  0 siblings, 1 reply; 32+ messages in thread
From: Theodore Ts'o @ 2005-09-16  3:39 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Akshat Aranya, linux-fsdevel

On Thu, Sep 15, 2005 at 09:25:55PM +0100, Anton Altaparmakov wrote:
> > POSIX (or SUSv3) does not guarantee the offset data structure to be
> > the dirent structure at all.  So a portable application should not
> > count of d_off on being present.
> 
> Why should f_pos be the dirent structure?!?  That would be completely 
> insane...

The only fields that are guaranted to be in the dirent structure by
the POSIX specification are:

	ino_t	d_ino;
	char	d_name[];

No other fields must be provided by a POSIX-compliant implementation.
Hence, a portable application must not use d_off.  It may not be
present on all POSIX-compliant operating systems.

> > which is unspecified is whether a file which is deleted or added after
> > the application has started iterating over the directory will be
> > included or not.  (Think about it; Unix is a multi-user, time-sharing
> > system.  Nothing else makes sense, since otherwise programs that used
> > readdir() would randomly break if a directory is modified by another
> > process at the same time.)
> 
> Only if they are written badly!  Also it depends what you mean by break.  
> For example "while (i = readdir); do rm i; done" would not break, it would 
> simply miss some files.  It would not produce an error.

It would mean that applications would behave unreliable if a directory
is changing while it is calling readdir().  I would call that breaking.

> If it were properly written when it is finished it would check if there 
> are still things to delete and start again if so and keep looping until 
> there are none left.

It might not be deleting files; it might just be searching all of the
files in a directory, so having some (potentially large) number of
files not be searched just because a file happened to be added to a
directory is BAD.  It is a silent error which an application can not
detect.

> Anything else _cannot_ work unless opendir() results in a read_lock on the 
> directory and it is only unlocked on close.  Nothing else is sane and will 
> result in a trivial DOS by any user on the system where a fs has to play 
> tricks.

Not true.  A clever fs can play tricks that will not result in a
denial of service attack.

> > In fact, POSIX requires that telldir() and seekdir() do the right
> > thing even if directory entries are added or deleted between the
> > telldir() and seekdir().  Yes, this is hard on directories which use
> 
> Sorry what do you mean?  They will and can work fine.  You use telldir to 
> give you and offset (f_pos) and seekdir puts the offset into f_pos.  

telldir() and seekdir() do not necessarily have to return offsets.
They simply have to return a cookie which can be used to return to
that particular location.  For example, ext3/htree uses the hash used
the sort-key in the hashed b-tree as the cookie.

You're right.  The 1990 Posix specification does not guarantee that
telldir()/seekdir() exists; unfortunately a sufficiently large number
of applications (including Samba) assume that it exists, so if you
don't implement it correctly, your filesystem will not be usable for
those applications.

> > [JFS uses an]
> > entirely separate b-tree just to guarantee telldir() and seekdir()
> > indexes behave properly in the presence of file inserts and removals.
> 
> So any user can cause DOS/OOM by doing a: "while 1; do opendir(); 
> readdir(); done" on a really big directory (note how I am never closing 
> the directory)...  What a fantastic filesystem that is!  All the sheep are 
> jumping off the bridge, lets jump, too!  I think not...

First of all, the separate b-tree is maintained on disk, as a
permanent part of the filesystem metadata.  When a directory entry is
added to a directory, it creates a unique seekdir index which is added
to the seekdir/telldir b-tree.  This b-tree is used for nothing else,
and readdir() returns directory entries in the seekdir-index order, by
walking the seekdir-index b-tree.  

If a program does "while (1); do opendir(); readdir(); done", the
opendir will eventually return an error when you consume all available
file descriptors.  It is no different from "while (1); do open(); done".

> readdir() is the problem.  It is _impossible_ to do what POSIX demands 
> using readdir without some form of lock to say "directory cannot be 
> modified".  Or if not a lock then a snapshot.  That is exactly what it is 
> asking for!  I guess that would be the only way to support it.  Snapshot 
> the directory and internally queue all modifications or apply them using 
> COW.  But the problem even then is hwo do you know when the user has 
> finished calling readdir().  There is no guarantee they will keep going 
> till EOD is reached.  There is not even any guarantee the user will close 
> the directory they opened.  Again, this would be a DOS and cause OOM in no 
> time on a huge directory.

Not so old Chinese saying...  "Man who says something is impossible
should not interrupt man doing it".

Ext3 with hashed-trees does not use a lock to prevent directory
modifications, and yet can guarantee that each file in the directory
is returned once and only once, even if nodes in the tree need to be
split.  Yes, it requires some cleverness in the implementation, but it
_can_ be done.

JFS also uses a b-tree (or more than one b-tree) to index its
directory, and also gives the same guarantee, but it implements it in
an entirely different way.

Two different filesystems; two different strategies; both implement
this guarantee without needing to lock the directory against
modifications during the readdir() scan.  Take a look at the
implementations closely.

(Hint: both involve walking the b-tree and returning readdir() entries
in tree order.  The details though of which b-tree, how the b-tree is
indexed, how the index is stored in file descriptor offset, and how
telldir/seekdir are implemented are different in the two filesystems,
however.)

						- Ted


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 18:19             ` Theodore Ts'o
  2005-09-15 21:04               ` Anton Altaparmakov
  2005-09-15 21:47               ` Jörn Engel
@ 2005-09-16  7:29               ` Nikita Danilov
  2005-09-16 11:58                 ` Theodore Ts'o
  2 siblings, 1 reply; 32+ messages in thread
From: Nikita Danilov @ 2005-09-16  7:29 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Blunck, Anton Altaparmakov, miklos, aaranya, linux-fsdevel

Theodore Ts'o writes:
 > On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote:
 > > Correct.  But you're missing half of the problem.
 > > 
 > > The other half is getting O(1)  or O(ln(n)) performance for lookup on
 > > a large directory.  Solution to this half alone would be to use a
 > > balanced tree for directory entries.  Now, using a self-balancing tree
 > > would not solve your half of the problem, so it is not a full
 > > solution.
 > > 
 > > Ext2, as far as I understand it, solved both halves by having both a
 > > hashtree and the sequential list.  Getdents is walking the list and
 > > ignoring entries with the "deleted" flag.  Lookup is using the
 > > hashtree, which points to the list for actual data.
 > 
 > Actually, no.  What we do is return the directory entries in hash sort
 > order, by walking the btree.  And we use the hash as the offset
 > returned in d_off and via telldir().  The reason for this?  So that if
 > we add files which causes a node to split, that we still only return
 > files in the directory once and only once.  If we traversed the tree

Except for hash collided directory entries, that are returned multiple
times, because after seekdir() ext2_readdir() restarts from the start of
hash-bucket, right?

Nikita.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 21:04               ` Anton Altaparmakov
@ 2005-09-16  7:50                 ` Nikita Danilov
  0 siblings, 0 replies; 32+ messages in thread
From: Nikita Danilov @ 2005-09-16  7:50 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: J?rn Engel, Jan Blunck, miklos, aaranya, linux-fsdevel

Anton Altaparmakov writes:

[...]

 > 
 > I don't know how your hash works but on ntfs this does not work like that.  
 > There is no parameter which would be able to tell you where you are in the 
 > b tree, unless you take a reference on an entry so it cannot be removed 
 > (rename==remove btw) but how does the reference get released later?

Reiser4 has exactly the same problem, and this is how it was solved (not
for the weak of heart):

 - offset of a directory entry is its ordinal number within a
 directory. I.e., first name in the directory has offset 0, second---1,
 etc.;

 - struct file for directory remembers "current readdir position" which
 is a pair (ordinal-number-of-entry, key-of-entry-in-the--tree);

 - on seekdir(dir, pos) (== lseek(dir, pos)), pos is the ordinal number
 of the target entry, and comparing it with the current readdir position
 we know how many entries to rewind to the left or to the right to reach
 the target. This can be done in any reasonable *-tree (for suitable
 values of *);

 - now the fun (or horror) part: above rewinding does not work correctly
 when entries were added or removed between current position and the
 target entry. To work about this, all directory modifications are
 monitored and current position is adjusted accordingly: when new
 directory entry is created, its key is compared against key in the
 current readdir position. If entry's key is smaller, then entry is
 added "before" current position and current position's
 ordinal-number-of-entry is incremented. Similarly for entry removal;

 - this of course is still fragile and can be broken, but that's enough
 for most cases (rm -r and friends work, for example);

 - additional complexities due to non-unique keys in the tree and NFS
 (where we don't have persistent struct file for a directory) are left
 as an exercise for a reader.

Nikita.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-15 18:08     ` Nikita Danilov
@ 2005-09-16 11:23       ` Miklos Szeredi
  0 siblings, 0 replies; 32+ messages in thread
From: Miklos Szeredi @ 2005-09-16 11:23 UTC (permalink / raw)
  To: nikita; +Cc: aia21, linux-fsdevel

>  > 
>  > Bonnie++'s code is just complete crap...  It is the author's fault that
>  > it will not work on filesystems where the directory entries are not in
>  > fixed locations...
> 
> Unfortunately Single UNIX Specification states that "offset"
> (telldir(3)'s result) is only invalidated by calls to rewinddir(3).

And the fact the rewinddir() invalidates the offsets returned by
telldir() is unfortunately not very useful, since the current kernel
interface doesn't distingish seekdir() to to beginning of the stream
(which _shoud not_ invalidate offsets) and rewiddir() (which _may_
invalidate offsets).

IOW offsets returned by telldir() will have to be valid even after
rewiddir().

Miklos

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-16  3:39         ` Theodore Ts'o
@ 2005-09-16 11:57           ` Dave Kleikamp
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Kleikamp @ 2005-09-16 11:57 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Anton Altaparmakov, Akshat Aranya, linux-fsdevel

On Thu, 2005-09-15 at 23:39 -0400, Theodore Ts'o wrote:
> JFS also uses a b-tree (or more than one b-tree) to index its
> directory, and also gives the same guarantee, but it implements it in
> an entirely different way.

Here are the detail of jfs's implementation.  Each directory entry is
assigned sequential "index" when it is created, which is stored in the
directory entry in the dtree (the primary directory data structure).  In
addition, an entry is created in the "index table", which is an array
keyed on the index.  Initially, this table can be stored directly in the
inode, but as it grows, the table is stored in other blocks, and is
addressed by a b-tree anchored in the inode.  For an existing entry, the
table stores the address of the dtree page, and the offset within the
page.  This table is updated whenever a change to the directory moves
the entry.  When an entry is deleted, the table contains the index of
the next existing entry in dtree-order.

Resuming a search means getting the table element for the index,
checking a flag whether it exists or has been deleted, following the
chain of "next" indices until an existing one is found, and getting the
dtree page address and offset of the entry.

The disadvantages are the need to update the directory table as entries
are moved, and the fact that the index table does not shrink as entries
are removed.  (The table is completely truncated if all entries are
removed from a directory, but that isn't typically going to happen.)

> Two different filesystems; two different strategies; both implement
> this guarantee without needing to lock the directory against
> modifications during the readdir() scan.  Take a look at the
> implementations closely.
> 
> (Hint: both involve walking the b-tree and returning readdir() entries
> in tree order.  The details though of which b-tree, how the b-tree is
> indexed, how the index is stored in file descriptor offset, and how
> telldir/seekdir are implemented are different in the two filesystems,
> however.)

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Expected getdents behaviour
  2005-09-16  7:29               ` Nikita Danilov
@ 2005-09-16 11:58                 ` Theodore Ts'o
  0 siblings, 0 replies; 32+ messages in thread
From: Theodore Ts'o @ 2005-09-16 11:58 UTC (permalink / raw)
  To: Nikita Danilov
  Cc: Jan Blunck, Anton Altaparmakov, miklos, aaranya, linux-fsdevel

On Fri, Sep 16, 2005 at 11:29:29AM +0400, Nikita Danilov wrote:
>  > Actually, no.  What we do is return the directory entries in hash sort
>  > order, by walking the btree.  And we use the hash as the offset
>  > returned in d_off and via telldir().  The reason for this?  So that if
>  > we add files which causes a node to split, that we still only return
>  > files in the directory once and only once.  If we traversed the tree
> 
> Except for hash collided directory entries, that are returned multiple
> times, because after seekdir() ext2_readdir() restarts from the start of
> hash-bucket, right?

Yes, it's not perfect.  But if you don't use telldir()/seekdir(),
readdir() will return files in a directory once and only once.   

Actually, there's another exception, but that only happens when you
have a hash collision, readdir() is in the middle of returning
multiple directory entries all with the same hash, and at that moment
the node containing the hash collisions gets split.  This is quite
rare, and pretty much impossible to trigger deliberately --- because
the hash algorithm uses a filesystem specific secret to prevent
attackers from deliberately creating files that could cause a hash
collision, and potentially cause applications to misbehave.

If you do use telldir/seekdir, and f_pos is pointing at a multiple
directories with the same hash, then yes, you could get some repeats.
But that's the best we can do given the horrific POSIX interface.  You
can solve the problem 100% perfectly if you adjust the filesystem
format so there is a separate b-tree maintained solely for the purpose
of keeping the telldir/seekdir indexes unique, and so you can traverse
the tree in telldir index order.  But of course the downside of that
any operation that modifies the directory has extra overhead because
there is an extra b-tree on disk that has to be kept up to date.

If you have 64-bit telldir() cookies, and can count on being able to
use 64-bit cookies for NFS, this also makes life easier (but still not
100% perfect); but there are a lot of 32-bit only systems still left
in the world.

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-09-16 11:58 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-15 13:57 Expected getdents behaviour Akshat Aranya
2005-09-15 14:03 ` Peter Staubach
2005-09-15 14:07 ` Anton Altaparmakov
2005-09-15 14:12   ` Anton Altaparmakov
2005-09-15 14:45     ` Miklos Szeredi
2005-09-15 15:17       ` Anton Altaparmakov
2005-09-15 16:41         ` Jan Blunck
2005-09-15 17:46           ` Jörn Engel
2005-09-15 18:19             ` Theodore Ts'o
2005-09-15 21:04               ` Anton Altaparmakov
2005-09-16  7:50                 ` Nikita Danilov
2005-09-15 21:47               ` Jörn Engel
2005-09-16  7:29               ` Nikita Danilov
2005-09-16 11:58                 ` Theodore Ts'o
2005-09-15 21:00             ` Anton Altaparmakov
2005-09-15 21:15               ` Charles P. Wright
2005-09-15 21:19                 ` Anton Altaparmakov
2005-09-15 20:28           ` Anton Altaparmakov
2005-09-15 16:51         ` Miklos Szeredi
2005-09-15 21:17           ` Anton Altaparmakov
2005-09-15 15:51     ` Theodore Ts'o
2005-09-15 16:52       ` Bryan Henderson
2005-09-15 16:57         ` Jeremy Allison
2005-09-15 20:51           ` Anton Altaparmakov
2005-09-15 20:50         ` Anton Altaparmakov
2005-09-15 23:41           ` Bryan Henderson
2005-09-15 20:25       ` Anton Altaparmakov
2005-09-16  3:39         ` Theodore Ts'o
2005-09-16 11:57           ` Dave Kleikamp
2005-09-15 18:08     ` Nikita Danilov
2005-09-16 11:23       ` Miklos Szeredi
2005-09-16  1:28   ` tridge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).