linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anton Altaparmakov <aia21@cam.ac.uk>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Akshat Aranya <aaranya@cs.sunysb.edu>, linux-fsdevel@vger.kernel.org
Subject: Re: Expected getdents behaviour
Date: Thu, 15 Sep 2005 21:25:55 +0100 (BST)	[thread overview]
Message-ID: <Pine.LNX.4.60.0509152028390.26539@hermes-1.csi.cam.ac.uk> (raw)
In-Reply-To: <20050915155108.GE22503@thunk.org>

On Thu, 15 Sep 2005, Theodore Ts'o wrote:
> On Thu, Sep 15, 2005 at 03:12:38PM +0100, Anton Altaparmakov wrote:
> > Oops.  I forgot to answer your question.  Yes, the filesystem needs to
> > consider the offset value in the second readdir to still be valid.  You
> > cannot keep rewinding back to zero every time you make a modification or
> > you would keep returning entries you have already returned and never
> > make any progress if e.g. some user does this in a loop at the same
> > time:
> 
> POSIX (or SUSv3) does not guarantee the offset data structure to be
> the dirent structure at all.  So a portable application should not
> count of d_off on being present.

Why should f_pos be the dirent structure?!?  That would be completely 
insane...

> That being said, it *is* fair game to assume that an application
> should be able to call readdir() repeatedly and get all files in the
> directory once and exactly once, even if another process is unlinking
> files or adding files while the readdir is going on.  The only thing

I disagree.  readdir() is a completely brain damaged interface and it is 
not fair game to assume that at all...

> which is unspecified is whether a file which is deleted or added after
> the application has started iterating over the directory will be
> included or not.  (Think about it; Unix is a multi-user, time-sharing
> system.  Nothing else makes sense, since otherwise programs that used
> readdir() would randomly break if a directory is modified by another
> process at the same time.)

Only if they are written badly!  Also it depends what you mean by break.  
For example "while (i = readdir); do rm i; done" would not break, it would 
simply miss some files.  It would not produce an error.

If it were properly written when it is finished it would check if there 
are still things to delete and start again if so and keep looping until 
there are none left.

Anything else _cannot_ work unless opendir() results in a read_lock on the 
directory and it is only unlocked on close.  Nothing else is sane and will 
result in a trivial DOS by any user on the system where a fs has to play 
tricks.

> In fact, POSIX requires that telldir() and seekdir() do the right
> thing even if directory entries are added or deleted between the
> telldir() and seekdir().  Yes, this is hard on directories which use

Sorry what do you mean?  They will and can work fine.  You use telldir to 
give you and offset (f_pos) and seekdir puts the offset into f_pos.  
Nothing more nothing less.  If you have removed files or added files in 
between two readdir calls (irrelevant whether you used seek/telldir) the 
f_pos will just now point in the wrong place and you will get some entries 
duplicated or you will miss some because you did not rewind back to 0 
after the change and because the directory was not locked against 
modifications.

> something more sophisticated a simple linked list to store their
> directory entries (like a b-tree, for example).  However, it is

Yes, ntfs uses a B tree.

> required by POSIX/SUSv3.  The JFS filesystem, for example, uses an

Er, have you read it?  To quote from "IEEE Std 1003.1, 2004 Edition", 
seekdir, from the informative "rationale" section:

<quote>
The original standard developers perceived that there were restrictions on 
the use of the seekdir() and telldir() functions related to implementation 
details, and for that reason these functions need not be supported on all 
POSIX-conforming systems. They are required on implementations supporting 
the XSI extension.

One of the perceived problems of implementation is that returning to a 
given point in a directory is quite difficult to describe formally, in 
spite of its intuitive appeal, when systems that use B-trees, hashing 
functions, or other similar mechanisms to order their directories are 
considered. The definition of seekdir() and telldir() does not specify 
whether, when using these interfaces, a given directory entry will be seen 
at all, or more than once.

On systems not supporting these functions, their capability can sometimes 
be accomplished by saving a filename found by readdir() and later using 
rewinddir() and a loop on readdir() to relocate the position from which 
the filename was saved.
</quote>

Thus any application relying on f_pos in a directory to be meaningful is 
broken by design and even POSIX says so.  Heck seek/telldir is not even 
required in POSIX unless you implement he XSI extension (I admit I have 
no idea what XSI is)!  (It says so above...)

> entirely separate b-tree just to guarantee telldir() and seekdir()
> indexes behave properly in the presence of file inserts and removals.

So any user can cause DOS/OOM by doing a: "while 1; do opendir(); 
readdir(); done" on a really big directory (note how I am never closing 
the directory)...  What a fantastic filesystem that is!  All the sheep are 
jumping off the bridge, lets jump, too!  I think not...

> > Bonnie++'s code is just complete crap...  It is the author's fault that
> > it will not work on filesystems where the directory entries are not in
> > fixed locations...
> 
> If Bonnie++ is relying on d_off, then yet.  But in fact, if Bonniee++
> is just doing a series of readdir()'s, and the filesystem doesn't do
> the right thing in the face of concurrent deletes or file creates, it
> is in fact the filesystem which is broken.  It doesn't matter if the

I still disagree.  The standards are broken if they require that from 
readdir().  Obviously at least someone understands given the seekdir() 
description.

> filesystem is using a sophisticated b-tree data structure; it still
> has to do the right thing.  There is a lot of hair in ext3, jfs, xfs,
> reiserfs, etc. in order to guarantee this to be the case, since it is
> expected by Unix applications, and it is required by the standards
> specifications.
>
> (I often curse the POSIX specifiers for including telldir/seekdir into
> the standards, since it's hell to support, but it's there, and there
> are applications which rely on it --- unfortunately.)

seekdir()/telldir() are no problem as they are meaningless and POSIX 
agrees.

readdir() is the problem.  It is _impossible_ to do what POSIX demands 
using readdir without some form of lock to say "directory cannot be 
modified".  Or if not a lock then a snapshot.  That is exactly what it is 
asking for!  I guess that would be the only way to support it.  Snapshot 
the directory and internally queue all modifications or apply them using 
COW.  But the problem even then is hwo do you know when the user has 
finished calling readdir().  There is no guarantee they will keep going 
till EOD is reached.  There is not even any guarantee the user will close 
the directory they opened.  Again, this would be a DOS and cause OOM in no 
time on a huge directory.

Maybe I am missing something...  How would you suggest to work around the 
above described problems?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

  parent reply	other threads:[~2005-09-15 20:26 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-09-15 13:57 Expected getdents behaviour Akshat Aranya
2005-09-15 14:03 ` Peter Staubach
2005-09-15 14:07 ` Anton Altaparmakov
2005-09-15 14:12   ` Anton Altaparmakov
2005-09-15 14:45     ` Miklos Szeredi
2005-09-15 15:17       ` Anton Altaparmakov
2005-09-15 16:41         ` Jan Blunck
2005-09-15 17:46           ` Jörn Engel
2005-09-15 18:19             ` Theodore Ts'o
2005-09-15 21:04               ` Anton Altaparmakov
2005-09-16  7:50                 ` Nikita Danilov
2005-09-15 21:47               ` Jörn Engel
2005-09-16  7:29               ` Nikita Danilov
2005-09-16 11:58                 ` Theodore Ts'o
2005-09-15 21:00             ` Anton Altaparmakov
2005-09-15 21:15               ` Charles P. Wright
2005-09-15 21:19                 ` Anton Altaparmakov
2005-09-15 20:28           ` Anton Altaparmakov
2005-09-15 16:51         ` Miklos Szeredi
2005-09-15 21:17           ` Anton Altaparmakov
2005-09-15 15:51     ` Theodore Ts'o
2005-09-15 16:52       ` Bryan Henderson
2005-09-15 16:57         ` Jeremy Allison
2005-09-15 20:51           ` Anton Altaparmakov
2005-09-15 20:50         ` Anton Altaparmakov
2005-09-15 23:41           ` Bryan Henderson
2005-09-15 20:25       ` Anton Altaparmakov [this message]
2005-09-16  3:39         ` Theodore Ts'o
2005-09-16 11:57           ` Dave Kleikamp
2005-09-15 18:08     ` Nikita Danilov
2005-09-16 11:23       ` Miklos Szeredi
2005-09-16  1:28   ` tridge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.60.0509152028390.26539@hermes-1.csi.cam.ac.uk \
    --to=aia21@cam.ac.uk \
    --cc=aaranya@cs.sunysb.edu \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).