From mboxrd@z Thu Jan 1 00:00:00 1970 From: Theodore Ts'o Subject: Re: Expected getdents behaviour Date: Thu, 15 Sep 2005 23:39:48 -0400 Message-ID: <20050916033945.GA11047@thunk.org> References: <1126793268.1676.9.camel@imp.csi.cam.ac.uk> <1126793558.1676.15.camel@imp.csi.cam.ac.uk> <20050915155108.GE22503@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Akshat Aranya , linux-fsdevel@vger.kernel.org Return-path: Received: from THUNK.ORG ([69.25.196.29]:42133 "EHLO thunker.thunk.org") by vger.kernel.org with ESMTP id S1030582AbVIPDj7 (ORCPT ); Thu, 15 Sep 2005 23:39:59 -0400 To: Anton Altaparmakov Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Thu, Sep 15, 2005 at 09:25:55PM +0100, Anton Altaparmakov wrote: > > POSIX (or SUSv3) does not guarantee the offset data structure to be > > the dirent structure at all. So a portable application should not > > count of d_off on being present. > > Why should f_pos be the dirent structure?!? That would be completely > insane... The only fields that are guaranted to be in the dirent structure by the POSIX specification are: ino_t d_ino; char d_name[]; No other fields must be provided by a POSIX-compliant implementation. Hence, a portable application must not use d_off. It may not be present on all POSIX-compliant operating systems. > > which is unspecified is whether a file which is deleted or added after > > the application has started iterating over the directory will be > > included or not. (Think about it; Unix is a multi-user, time-sharing > > system. Nothing else makes sense, since otherwise programs that used > > readdir() would randomly break if a directory is modified by another > > process at the same time.) > > Only if they are written badly! Also it depends what you mean by break. > For example "while (i = readdir); do rm i; done" would not break, it would > simply miss some files. It would not produce an error. It would mean that applications would behave unreliable if a directory is changing while it is calling readdir(). I would call that breaking. > If it were properly written when it is finished it would check if there > are still things to delete and start again if so and keep looping until > there are none left. It might not be deleting files; it might just be searching all of the files in a directory, so having some (potentially large) number of files not be searched just because a file happened to be added to a directory is BAD. It is a silent error which an application can not detect. > Anything else _cannot_ work unless opendir() results in a read_lock on the > directory and it is only unlocked on close. Nothing else is sane and will > result in a trivial DOS by any user on the system where a fs has to play > tricks. Not true. A clever fs can play tricks that will not result in a denial of service attack. > > In fact, POSIX requires that telldir() and seekdir() do the right > > thing even if directory entries are added or deleted between the > > telldir() and seekdir(). Yes, this is hard on directories which use > > Sorry what do you mean? They will and can work fine. You use telldir to > give you and offset (f_pos) and seekdir puts the offset into f_pos. telldir() and seekdir() do not necessarily have to return offsets. They simply have to return a cookie which can be used to return to that particular location. For example, ext3/htree uses the hash used the sort-key in the hashed b-tree as the cookie. You're right. The 1990 Posix specification does not guarantee that telldir()/seekdir() exists; unfortunately a sufficiently large number of applications (including Samba) assume that it exists, so if you don't implement it correctly, your filesystem will not be usable for those applications. > > [JFS uses an] > > entirely separate b-tree just to guarantee telldir() and seekdir() > > indexes behave properly in the presence of file inserts and removals. > > So any user can cause DOS/OOM by doing a: "while 1; do opendir(); > readdir(); done" on a really big directory (note how I am never closing > the directory)... What a fantastic filesystem that is! All the sheep are > jumping off the bridge, lets jump, too! I think not... First of all, the separate b-tree is maintained on disk, as a permanent part of the filesystem metadata. When a directory entry is added to a directory, it creates a unique seekdir index which is added to the seekdir/telldir b-tree. This b-tree is used for nothing else, and readdir() returns directory entries in the seekdir-index order, by walking the seekdir-index b-tree. If a program does "while (1); do opendir(); readdir(); done", the opendir will eventually return an error when you consume all available file descriptors. It is no different from "while (1); do open(); done". > readdir() is the problem. It is _impossible_ to do what POSIX demands > using readdir without some form of lock to say "directory cannot be > modified". Or if not a lock then a snapshot. That is exactly what it is > asking for! I guess that would be the only way to support it. Snapshot > the directory and internally queue all modifications or apply them using > COW. But the problem even then is hwo do you know when the user has > finished calling readdir(). There is no guarantee they will keep going > till EOD is reached. There is not even any guarantee the user will close > the directory they opened. Again, this would be a DOS and cause OOM in no > time on a huge directory. Not so old Chinese saying... "Man who says something is impossible should not interrupt man doing it". Ext3 with hashed-trees does not use a lock to prevent directory modifications, and yet can guarantee that each file in the directory is returned once and only once, even if nodes in the tree need to be split. Yes, it requires some cleverness in the implementation, but it _can_ be done. JFS also uses a b-tree (or more than one b-tree) to index its directory, and also gives the same guarantee, but it implements it in an entirely different way. Two different filesystems; two different strategies; both implement this guarantee without needing to lock the directory against modifications during the readdir() scan. Take a look at the implementations closely. (Hint: both involve walking the b-tree and returning readdir() entries in tree order. The details though of which b-tree, how the b-tree is indexed, how the index is stored in file descriptor offset, and how telldir/seekdir are implemented are different in the two filesystems, however.) - Ted