From: Anton Altaparmakov <aia21@cam.ac.uk>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Akshat Aranya <aaranya@cs.sunysb.edu>, linux-fsdevel@vger.kernel.org
Subject: Re: Expected getdents behaviour
Date: Thu, 15 Sep 2005 21:25:55 +0100 (BST) [thread overview]
Message-ID: <Pine.LNX.4.60.0509152028390.26539@hermes-1.csi.cam.ac.uk> (raw)
In-Reply-To: <20050915155108.GE22503@thunk.org>
On Thu, 15 Sep 2005, Theodore Ts'o wrote:
> On Thu, Sep 15, 2005 at 03:12:38PM +0100, Anton Altaparmakov wrote:
> > Oops. I forgot to answer your question. Yes, the filesystem needs to
> > consider the offset value in the second readdir to still be valid. You
> > cannot keep rewinding back to zero every time you make a modification or
> > you would keep returning entries you have already returned and never
> > make any progress if e.g. some user does this in a loop at the same
> > time:
>
> POSIX (or SUSv3) does not guarantee the offset data structure to be
> the dirent structure at all. So a portable application should not
> count of d_off on being present.
Why should f_pos be the dirent structure?!? That would be completely
insane...
> That being said, it *is* fair game to assume that an application
> should be able to call readdir() repeatedly and get all files in the
> directory once and exactly once, even if another process is unlinking
> files or adding files while the readdir is going on. The only thing
I disagree. readdir() is a completely brain damaged interface and it is
not fair game to assume that at all...
> which is unspecified is whether a file which is deleted or added after
> the application has started iterating over the directory will be
> included or not. (Think about it; Unix is a multi-user, time-sharing
> system. Nothing else makes sense, since otherwise programs that used
> readdir() would randomly break if a directory is modified by another
> process at the same time.)
Only if they are written badly! Also it depends what you mean by break.
For example "while (i = readdir); do rm i; done" would not break, it would
simply miss some files. It would not produce an error.
If it were properly written when it is finished it would check if there
are still things to delete and start again if so and keep looping until
there are none left.
Anything else _cannot_ work unless opendir() results in a read_lock on the
directory and it is only unlocked on close. Nothing else is sane and will
result in a trivial DOS by any user on the system where a fs has to play
tricks.
> In fact, POSIX requires that telldir() and seekdir() do the right
> thing even if directory entries are added or deleted between the
> telldir() and seekdir(). Yes, this is hard on directories which use
Sorry what do you mean? They will and can work fine. You use telldir to
give you and offset (f_pos) and seekdir puts the offset into f_pos.
Nothing more nothing less. If you have removed files or added files in
between two readdir calls (irrelevant whether you used seek/telldir) the
f_pos will just now point in the wrong place and you will get some entries
duplicated or you will miss some because you did not rewind back to 0
after the change and because the directory was not locked against
modifications.
> something more sophisticated a simple linked list to store their
> directory entries (like a b-tree, for example). However, it is
Yes, ntfs uses a B tree.
> required by POSIX/SUSv3. The JFS filesystem, for example, uses an
Er, have you read it? To quote from "IEEE Std 1003.1, 2004 Edition",
seekdir, from the informative "rationale" section:
<quote>
The original standard developers perceived that there were restrictions on
the use of the seekdir() and telldir() functions related to implementation
details, and for that reason these functions need not be supported on all
POSIX-conforming systems. They are required on implementations supporting
the XSI extension.
One of the perceived problems of implementation is that returning to a
given point in a directory is quite difficult to describe formally, in
spite of its intuitive appeal, when systems that use B-trees, hashing
functions, or other similar mechanisms to order their directories are
considered. The definition of seekdir() and telldir() does not specify
whether, when using these interfaces, a given directory entry will be seen
at all, or more than once.
On systems not supporting these functions, their capability can sometimes
be accomplished by saving a filename found by readdir() and later using
rewinddir() and a loop on readdir() to relocate the position from which
the filename was saved.
</quote>
Thus any application relying on f_pos in a directory to be meaningful is
broken by design and even POSIX says so. Heck seek/telldir is not even
required in POSIX unless you implement he XSI extension (I admit I have
no idea what XSI is)! (It says so above...)
> entirely separate b-tree just to guarantee telldir() and seekdir()
> indexes behave properly in the presence of file inserts and removals.
So any user can cause DOS/OOM by doing a: "while 1; do opendir();
readdir(); done" on a really big directory (note how I am never closing
the directory)... What a fantastic filesystem that is! All the sheep are
jumping off the bridge, lets jump, too! I think not...
> > Bonnie++'s code is just complete crap... It is the author's fault that
> > it will not work on filesystems where the directory entries are not in
> > fixed locations...
>
> If Bonnie++ is relying on d_off, then yet. But in fact, if Bonniee++
> is just doing a series of readdir()'s, and the filesystem doesn't do
> the right thing in the face of concurrent deletes or file creates, it
> is in fact the filesystem which is broken. It doesn't matter if the
I still disagree. The standards are broken if they require that from
readdir(). Obviously at least someone understands given the seekdir()
description.
> filesystem is using a sophisticated b-tree data structure; it still
> has to do the right thing. There is a lot of hair in ext3, jfs, xfs,
> reiserfs, etc. in order to guarantee this to be the case, since it is
> expected by Unix applications, and it is required by the standards
> specifications.
>
> (I often curse the POSIX specifiers for including telldir/seekdir into
> the standards, since it's hell to support, but it's there, and there
> are applications which rely on it --- unfortunately.)
seekdir()/telldir() are no problem as they are meaningless and POSIX
agrees.
readdir() is the problem. It is _impossible_ to do what POSIX demands
using readdir without some form of lock to say "directory cannot be
modified". Or if not a lock then a snapshot. That is exactly what it is
asking for! I guess that would be the only way to support it. Snapshot
the directory and internally queue all modifications or apply them using
COW. But the problem even then is hwo do you know when the user has
finished calling readdir(). There is no guarantee they will keep going
till EOD is reached. There is not even any guarantee the user will close
the directory they opened. Again, this would be a DOS and cause OOM in no
time on a huge directory.
Maybe I am missing something... How would you suggest to work around the
above described problems?
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
next prev parent reply other threads:[~2005-09-15 20:26 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-09-15 13:57 Expected getdents behaviour Akshat Aranya
2005-09-15 14:03 ` Peter Staubach
2005-09-15 14:07 ` Anton Altaparmakov
2005-09-15 14:12 ` Anton Altaparmakov
2005-09-15 14:45 ` Miklos Szeredi
2005-09-15 15:17 ` Anton Altaparmakov
2005-09-15 16:41 ` Jan Blunck
2005-09-15 17:46 ` Jörn Engel
2005-09-15 18:19 ` Theodore Ts'o
2005-09-15 21:04 ` Anton Altaparmakov
2005-09-16 7:50 ` Nikita Danilov
2005-09-15 21:47 ` Jörn Engel
2005-09-16 7:29 ` Nikita Danilov
2005-09-16 11:58 ` Theodore Ts'o
2005-09-15 21:00 ` Anton Altaparmakov
2005-09-15 21:15 ` Charles P. Wright
2005-09-15 21:19 ` Anton Altaparmakov
2005-09-15 20:28 ` Anton Altaparmakov
2005-09-15 16:51 ` Miklos Szeredi
2005-09-15 21:17 ` Anton Altaparmakov
2005-09-15 15:51 ` Theodore Ts'o
2005-09-15 16:52 ` Bryan Henderson
2005-09-15 16:57 ` Jeremy Allison
2005-09-15 20:51 ` Anton Altaparmakov
2005-09-15 20:50 ` Anton Altaparmakov
2005-09-15 23:41 ` Bryan Henderson
2005-09-15 20:25 ` Anton Altaparmakov [this message]
2005-09-16 3:39 ` Theodore Ts'o
2005-09-16 11:57 ` Dave Kleikamp
2005-09-15 18:08 ` Nikita Danilov
2005-09-16 11:23 ` Miklos Szeredi
2005-09-16 1:28 ` tridge
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.60.0509152028390.26539@hermes-1.csi.cam.ac.uk \
--to=aia21@cam.ac.uk \
--cc=aaranya@cs.sunysb.edu \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).