From mboxrd@z Thu Jan  1 00:00:00 1970
From: Theodore Ts'o <tytso@mit.edu>
Subject: Re: Expected getdents behaviour
Date: Thu, 15 Sep 2005 23:39:48 -0400
Message-ID: <20050916033945.GA11047@thunk.org>
References: <e483447805091506573daebc21@mail.gmail.com> <1126793268.1676.9.camel@imp.csi.cam.ac.uk> <1126793558.1676.15.camel@imp.csi.cam.ac.uk> <20050915155108.GE22503@thunk.org> <Pine.LNX.4.60.0509152028390.26539@hermes-1.csi.cam.ac.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Akshat Aranya <aaranya@cs.sunysb.edu>,
	linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from THUNK.ORG ([69.25.196.29]:42133 "EHLO thunker.thunk.org")
	by vger.kernel.org with ESMTP id S1030582AbVIPDj7 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 15 Sep 2005 23:39:59 -0400
To: Anton Altaparmakov <aia21@cam.ac.uk>
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.60.0509152028390.26539@hermes-1.csi.cam.ac.uk>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

On Thu, Sep 15, 2005 at 09:25:55PM +0100, Anton Altaparmakov wrote:
> > POSIX (or SUSv3) does not guarantee the offset data structure to be
> > the dirent structure at all.  So a portable application should not
> > count of d_off on being present.
> 
> Why should f_pos be the dirent structure?!?  That would be completely 
> insane...

The only fields that are guaranted to be in the dirent structure by
the POSIX specification are:

	ino_t	d_ino;
	char	d_name[];

No other fields must be provided by a POSIX-compliant implementation.
Hence, a portable application must not use d_off.  It may not be
present on all POSIX-compliant operating systems.

> > which is unspecified is whether a file which is deleted or added after
> > the application has started iterating over the directory will be
> > included or not.  (Think about it; Unix is a multi-user, time-sharing
> > system.  Nothing else makes sense, since otherwise programs that used
> > readdir() would randomly break if a directory is modified by another
> > process at the same time.)
> 
> Only if they are written badly!  Also it depends what you mean by break.  
> For example "while (i = readdir); do rm i; done" would not break, it would 
> simply miss some files.  It would not produce an error.

It would mean that applications would behave unreliable if a directory
is changing while it is calling readdir().  I would call that breaking.

> If it were properly written when it is finished it would check if there 
> are still things to delete and start again if so and keep looping until 
> there are none left.

It might not be deleting files; it might just be searching all of the
files in a directory, so having some (potentially large) number of
files not be searched just because a file happened to be added to a
directory is BAD.  It is a silent error which an application can not
detect.

> Anything else _cannot_ work unless opendir() results in a read_lock on the 
> directory and it is only unlocked on close.  Nothing else is sane and will 
> result in a trivial DOS by any user on the system where a fs has to play 
> tricks.

Not true.  A clever fs can play tricks that will not result in a
denial of service attack.

> > In fact, POSIX requires that telldir() and seekdir() do the right
> > thing even if directory entries are added or deleted between the
> > telldir() and seekdir().  Yes, this is hard on directories which use
> 
> Sorry what do you mean?  They will and can work fine.  You use telldir to 
> give you and offset (f_pos) and seekdir puts the offset into f_pos.  

telldir() and seekdir() do not necessarily have to return offsets.
They simply have to return a cookie which can be used to return to
that particular location.  For example, ext3/htree uses the hash used
the sort-key in the hashed b-tree as the cookie.

You're right.  The 1990 Posix specification does not guarantee that
telldir()/seekdir() exists; unfortunately a sufficiently large number
of applications (including Samba) assume that it exists, so if you
don't implement it correctly, your filesystem will not be usable for
those applications.

> > [JFS uses an]
> > entirely separate b-tree just to guarantee telldir() and seekdir()
> > indexes behave properly in the presence of file inserts and removals.
> 
> So any user can cause DOS/OOM by doing a: "while 1; do opendir(); 
> readdir(); done" on a really big directory (note how I am never closing 
> the directory)...  What a fantastic filesystem that is!  All the sheep are 
> jumping off the bridge, lets jump, too!  I think not...

First of all, the separate b-tree is maintained on disk, as a
permanent part of the filesystem metadata.  When a directory entry is
added to a directory, it creates a unique seekdir index which is added
to the seekdir/telldir b-tree.  This b-tree is used for nothing else,
and readdir() returns directory entries in the seekdir-index order, by
walking the seekdir-index b-tree.  

If a program does "while (1); do opendir(); readdir(); done", the
opendir will eventually return an error when you consume all available
file descriptors.  It is no different from "while (1); do open(); done".

> readdir() is the problem.  It is _impossible_ to do what POSIX demands 
> using readdir without some form of lock to say "directory cannot be 
> modified".  Or if not a lock then a snapshot.  That is exactly what it is 
> asking for!  I guess that would be the only way to support it.  Snapshot 
> the directory and internally queue all modifications or apply them using 
> COW.  But the problem even then is hwo do you know when the user has 
> finished calling readdir().  There is no guarantee they will keep going 
> till EOD is reached.  There is not even any guarantee the user will close 
> the directory they opened.  Again, this would be a DOS and cause OOM in no 
> time on a huge directory.

Not so old Chinese saying...  "Man who says something is impossible
should not interrupt man doing it".

Ext3 with hashed-trees does not use a lock to prevent directory
modifications, and yet can guarantee that each file in the directory
is returned once and only once, even if nodes in the tree need to be
split.  Yes, it requires some cleverness in the implementation, but it
_can_ be done.

JFS also uses a b-tree (or more than one b-tree) to index its
directory, and also gives the same guarantee, but it implements it in
an entirely different way.

Two different filesystems; two different strategies; both implement
this guarantee without needing to lock the directory against
modifications during the readdir() scan.  Take a look at the
implementations closely.

(Hint: both involve walking the b-tree and returning readdir() entries
in tree order.  The details though of which b-tree, how the b-tree is
indexed, how the index is stored in file descriptor offset, and how
telldir/seekdir are implemented are different in the two filesystems,
however.)

						- Ted