* Expected getdents behaviour
@ 2005-09-15 13:57 Akshat Aranya
2005-09-15 14:03 ` Peter Staubach
2005-09-15 14:07 ` Anton Altaparmakov
0 siblings, 2 replies; 32+ messages in thread
From: Akshat Aranya @ 2005-09-15 13:57 UTC (permalink / raw)
To: linux-fsdevel
I noticed that bonnie++, in its directory tests, does the following on
a large directory
open(dir);
while (getdents() != 0)
{
unlink all the returned entries from getdents
}
close(dir);
My question is whether the filesystem's readdir is expected to
consider the offset value in the second readdir to still be valid,
given that entries from the directory were deleted after the first
readdir.
Akshat
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: Expected getdents behaviour 2005-09-15 13:57 Expected getdents behaviour Akshat Aranya @ 2005-09-15 14:03 ` Peter Staubach 2005-09-15 14:07 ` Anton Altaparmakov 1 sibling, 0 replies; 32+ messages in thread From: Peter Staubach @ 2005-09-15 14:03 UTC (permalink / raw) To: Akshat Aranya; +Cc: linux-fsdevel Akshat Aranya wrote: >I noticed that bonnie++, in its directory tests, does the following on >a large directory > >open(dir); >while (getdents() != 0) >{ > unlink all the returned entries from getdents >} >close(dir); > >My question is whether the filesystem's readdir is expected to >consider the offset value in the second readdir to still be valid, >given that entries from the directory were deleted after the first >readdir. > This "usually" works. Most file systems do not directory compaction when removing entries. However, some do, so to be safe, the algorithm needs to take into account this and also entries which can not be removed. Thanx... ps ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 13:57 Expected getdents behaviour Akshat Aranya 2005-09-15 14:03 ` Peter Staubach @ 2005-09-15 14:07 ` Anton Altaparmakov 2005-09-15 14:12 ` Anton Altaparmakov 2005-09-16 1:28 ` tridge 1 sibling, 2 replies; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 14:07 UTC (permalink / raw) To: Akshat Aranya; +Cc: linux-fsdevel On Thu, 2005-09-15 at 09:57 -0400, Akshat Aranya wrote: > I noticed that bonnie++, in its directory tests, does the following on > a large directory > > open(dir); > while (getdents() != 0) > { > unlink all the returned entries from getdents > } > close(dir); > > My question is whether the filesystem's readdir is expected to > consider the offset value in the second readdir to still be valid, > given that entries from the directory were deleted after the first > readdir. This would fail horribly on ntfs for example (once it supports file deletion anyway) since the offset value of all entries changes when the directory is modified. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 14:07 ` Anton Altaparmakov @ 2005-09-15 14:12 ` Anton Altaparmakov 2005-09-15 14:45 ` Miklos Szeredi ` (2 more replies) 2005-09-16 1:28 ` tridge 1 sibling, 3 replies; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 14:12 UTC (permalink / raw) To: Akshat Aranya; +Cc: linux-fsdevel On Thu, 2005-09-15 at 15:07 +0100, Anton Altaparmakov wrote: > On Thu, 2005-09-15 at 09:57 -0400, Akshat Aranya wrote: > > I noticed that bonnie++, in its directory tests, does the following on > > a large directory > > > > open(dir); > > while (getdents() != 0) > > { > > unlink all the returned entries from getdents > > } > > close(dir); > > > > My question is whether the filesystem's readdir is expected to > > consider the offset value in the second readdir to still be valid, > > given that entries from the directory were deleted after the first > > readdir. > > This would fail horribly on ntfs for example (once it supports file > deletion anyway) since the offset value of all entries changes when the > directory is modified. Oops. I forgot to answer your question. Yes, the filesystem needs to consider the offset value in the second readdir to still be valid. You cannot keep rewinding back to zero every time you make a modification or you would keep returning entries you have already returned and never make any progress if e.g. some user does this in a loop at the same time: while 1; do touch blah rm blah done It would be a very effective DOS that any user can trigger. Bonnie++'s code is just complete crap... It is the author's fault that it will not work on filesystems where the directory entries are not in fixed locations... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 14:12 ` Anton Altaparmakov @ 2005-09-15 14:45 ` Miklos Szeredi 2005-09-15 15:17 ` Anton Altaparmakov 2005-09-15 15:51 ` Theodore Ts'o 2005-09-15 18:08 ` Nikita Danilov 2 siblings, 1 reply; 32+ messages in thread From: Miklos Szeredi @ 2005-09-15 14:45 UTC (permalink / raw) To: aia21; +Cc: aaranya, linux-fsdevel > Oops. I forgot to answer your question. Yes, the filesystem needs to > consider the offset value in the second readdir to still be valid. You > cannot keep rewinding back to zero every time you make a modification or > you would keep returning entries you have already returned and never > make any progress if e.g. some user does this in a loop at the same > time: > > while 1; do > touch blah > rm blah > done > > It would be a very effective DOS that any user can trigger. > > Bonnie++'s code is just complete crap... It is the author's fault that > it will not work on filesystems where the directory entries are not in > fixed locations... Why is it crap? 'rm -rf' does exactly the same. It's up to the filesystem to keep track of the file position in the directory, regardless of any deletions or additions. Miklos ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 14:45 ` Miklos Szeredi @ 2005-09-15 15:17 ` Anton Altaparmakov 2005-09-15 16:41 ` Jan Blunck 2005-09-15 16:51 ` Miklos Szeredi 0 siblings, 2 replies; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 15:17 UTC (permalink / raw) To: Miklos Szeredi; +Cc: aaranya, linux-fsdevel On Thu, 2005-09-15 at 16:45 +0200, Miklos Szeredi wrote: > > Oops. I forgot to answer your question. Yes, the filesystem needs to > > consider the offset value in the second readdir to still be valid. You > > cannot keep rewinding back to zero every time you make a modification or > > you would keep returning entries you have already returned and never > > make any progress if e.g. some user does this in a loop at the same > > time: > > > > while 1; do > > touch blah > > rm blah > > done > > > > It would be a very effective DOS that any user can trigger. > > > > Bonnie++'s code is just complete crap... It is the author's fault that > > it will not work on filesystems where the directory entries are not in > > fixed locations... > > Why is it crap? 'rm -rf' does exactly the same. Then that is broken, too. But have you ever looked? I just did and you are partially wrong. It especially handles the case correctly by doing a "rewinddir()" on the directory before doing another readdir() after it has done an unlike. However you are right in that it only enables this code at ./configure time so on most Linux systems will not have it enabled in which case it will indeed be broken on filesystems like ntfs. > It's up to the filesystem to keep track of the file position in the > directory, regardless of any deletions or additions. That is completely untrue. It is up to the user. The file position in a directory is completely up to the caller and in particular when a directory is modified the file position in that directory is not changed by the filesystem. Doing this would in fact be impossible since the creation/deletion in the kernel has no access to all the "struct file"s with which a directory has been opened. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 15:17 ` Anton Altaparmakov @ 2005-09-15 16:41 ` Jan Blunck 2005-09-15 17:46 ` Jörn Engel 2005-09-15 20:28 ` Anton Altaparmakov 2005-09-15 16:51 ` Miklos Szeredi 1 sibling, 2 replies; 32+ messages in thread From: Jan Blunck @ 2005-09-15 16:41 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: miklos, aaranya, linux-fsdevel On Thu, Sep 15, Anton Altaparmakov wrote: > > That is completely untrue. It is up to the user. The file position in > a directory is completely up to the caller and in particular when a > directory is modified the file position in that directory is not changed > by the filesystem. Doing this would in fact be impossible since the > creation/deletion in the kernel has no access to all the "struct file"s > with which a directory has been opened. > No, thats not true. The file position in a regular (flat) file is completely up to the user. "Thanks" to the short-comings of POSIX on not defining the functions on directories, the file position in a directory is completely up to the file system. Due to that fact, modern file systems don't need to provide a "flat" view (in terms of sequential access to a directory file) on them. Although, the file system must guarantee, that readdir's must not fail because of changes directory contents. AFAIR ext2 is achieving this through not shrinking directories while the filesystem is online. Regards, Jan -- Jan Blunck jblunck@suse.de SuSE LINUX AG - A Novell company Maxfeldstr. 5 +49-911-74053-608 D-90409 Nürnberg http://www.suse.de - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 16:41 ` Jan Blunck @ 2005-09-15 17:46 ` Jörn Engel 2005-09-15 18:19 ` Theodore Ts'o 2005-09-15 21:00 ` Anton Altaparmakov 2005-09-15 20:28 ` Anton Altaparmakov 1 sibling, 2 replies; 32+ messages in thread From: Jörn Engel @ 2005-09-15 17:46 UTC (permalink / raw) To: Jan Blunck; +Cc: Anton Altaparmakov, miklos, aaranya, linux-fsdevel On Thu, 15 September 2005 18:41:10 +0200, Jan Blunck wrote: > > No, thats not true. The file position in a regular (flat) file is completely > up to the user. "Thanks" to the short-comings of POSIX on not defining the > functions on directories, the file position in a directory is completely up to > the file system. Due to that fact, modern file systems don't need to provide > a "flat" view (in terms of sequential access to a directory file) on > them. Although, the file system must guarantee, that readdir's must not fail > because of changes directory contents. AFAIR ext2 is achieving this through > not shrinking directories while the filesystem is online. Correct. But you're missing half of the problem. The other half is getting O(1) or O(ln(n)) performance for lookup on a large directory. Solution to this half alone would be to use a balanced tree for directory entries. Now, using a self-balancing tree would not solve your half of the problem, so it is not a full solution. Ext2, as far as I understand it, solved both halves by having both a hashtree and the sequential list. Getdents is walking the list and ignoring entries with the "deleted" flag. Lookup is using the hashtree, which points to the list for actual data. Jörn -- All art is but imitation of nature. -- Lucius Annaeus Seneca - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 17:46 ` Jörn Engel @ 2005-09-15 18:19 ` Theodore Ts'o 2005-09-15 21:04 ` Anton Altaparmakov ` (2 more replies) 2005-09-15 21:00 ` Anton Altaparmakov 1 sibling, 3 replies; 32+ messages in thread From: Theodore Ts'o @ 2005-09-15 18:19 UTC (permalink / raw) To: J?rn Engel; +Cc: Jan Blunck, Anton Altaparmakov, miklos, aaranya, linux-fsdevel On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote: > Correct. But you're missing half of the problem. > > The other half is getting O(1) or O(ln(n)) performance for lookup on > a large directory. Solution to this half alone would be to use a > balanced tree for directory entries. Now, using a self-balancing tree > would not solve your half of the problem, so it is not a full > solution. > > Ext2, as far as I understand it, solved both halves by having both a > hashtree and the sequential list. Getdents is walking the list and > ignoring entries with the "deleted" flag. Lookup is using the > hashtree, which points to the list for actual data. Actually, no. What we do is return the directory entries in hash sort order, by walking the btree. And we use the hash as the offset returned in d_off and via telldir(). The reason for this? So that if we add files which causes a node to split, that we still only return files in the directory once and only once. If we traversed the tree in sequential order only, then some files could be returned multiple times, and some files returned not at all after a node split operation. This is part of the "hair" that is needed in order to have a POSIX-compliant (and non-buggy) readdir/telldir implementation. The downside of returning files in hash sort order is that an application which tries to stat or open every single file in a directory in readdir() order will no longer do so in rough creation order, and this can cause the disk heads to seek all over the inode table to read the inode. This performance loss can fixed by sorting the files by inode number before opening them. This can be done in the application, or via a LD_PRELOAD library, which reads the entire directory at opendir() time, and then sorts the directory and returns it in sorted inode order. (Why can't we do this in the kernel? because for a very large directory, we would have to allocate a huge amount of non-swappable memory to store all of the filenames in the directory.) As I said, implementing POSIX-compliant opendir/readdir/telldir/seekdir is tricky for some filesystems, and filesystem engineers often curse POSIX telldir/seekdir. But they are a fact of life, and we have to deal with them. - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 18:19 ` Theodore Ts'o @ 2005-09-15 21:04 ` Anton Altaparmakov 2005-09-16 7:50 ` Nikita Danilov 2005-09-15 21:47 ` Jörn Engel 2005-09-16 7:29 ` Nikita Danilov 2 siblings, 1 reply; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 21:04 UTC (permalink / raw) To: Theodore Ts'o; +Cc: J?rn Engel, Jan Blunck, miklos, aaranya, linux-fsdevel On Thu, 15 Sep 2005, Theodore Ts'o wrote: > On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote: > > Correct. But you're missing half of the problem. > > > > The other half is getting O(1) or O(ln(n)) performance for lookup on > > a large directory. Solution to this half alone would be to use a > > balanced tree for directory entries. Now, using a self-balancing tree > > would not solve your half of the problem, so it is not a full > > solution. > > > > Ext2, as far as I understand it, solved both halves by having both a > > hashtree and the sequential list. Getdents is walking the list and > > ignoring entries with the "deleted" flag. Lookup is using the > > hashtree, which points to the list for actual data. > > Actually, no. What we do is return the directory entries in hash sort > order, by walking the btree. And we use the hash as the offset > returned in d_off and via telldir(). The reason for this? So that if > we add files which causes a node to split, that we still only return > files in the directory once and only once. If we traversed the tree > in sequential order only, then some files could be returned multiple > times, and some files returned not at all after a node split > operation. I don't know how your hash works but on ntfs this does not work like that. There is no parameter which would be able to tell you where you are in the b tree, unless you take a reference on an entry so it cannot be removed (rename==remove btw) but how does the reference get released later? > This is part of the "hair" that is needed in order to have a > POSIX-compliant (and non-buggy) readdir/telldir implementation. The > downside of returning files in hash sort order is that an application > which tries to stat or open every single file in a directory in > readdir() order will no longer do so in rough creation order, and this > can cause the disk heads to seek all over the inode table to read the > inode. This performance loss can fixed by sorting the files by inode > number before opening them. This can be done in the application, or > via a LD_PRELOAD library, which reads the entire directory at > opendir() time, and then sorts the directory and returns it in sorted > inode order. (Why can't we do this in the kernel? because for a very > large directory, we would have to allocate a huge amount of > non-swappable memory to store all of the filenames in the directory.) > > As I said, implementing POSIX-compliant > opendir/readdir/telldir/seekdir is tricky for some filesystems, and > filesystem engineers often curse POSIX telldir/seekdir. But they are > a fact of life, and we have to deal with them. Sure but given a large enough directory you will OOM the system rather than return the correct result which is why I keep saying the interface is broken... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 21:04 ` Anton Altaparmakov @ 2005-09-16 7:50 ` Nikita Danilov 0 siblings, 0 replies; 32+ messages in thread From: Nikita Danilov @ 2005-09-16 7:50 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: J?rn Engel, Jan Blunck, miklos, aaranya, linux-fsdevel Anton Altaparmakov writes: [...] > > I don't know how your hash works but on ntfs this does not work like that. > There is no parameter which would be able to tell you where you are in the > b tree, unless you take a reference on an entry so it cannot be removed > (rename==remove btw) but how does the reference get released later? Reiser4 has exactly the same problem, and this is how it was solved (not for the weak of heart): - offset of a directory entry is its ordinal number within a directory. I.e., first name in the directory has offset 0, second---1, etc.; - struct file for directory remembers "current readdir position" which is a pair (ordinal-number-of-entry, key-of-entry-in-the--tree); - on seekdir(dir, pos) (== lseek(dir, pos)), pos is the ordinal number of the target entry, and comparing it with the current readdir position we know how many entries to rewind to the left or to the right to reach the target. This can be done in any reasonable *-tree (for suitable values of *); - now the fun (or horror) part: above rewinding does not work correctly when entries were added or removed between current position and the target entry. To work about this, all directory modifications are monitored and current position is adjusted accordingly: when new directory entry is created, its key is compared against key in the current readdir position. If entry's key is smaller, then entry is added "before" current position and current position's ordinal-number-of-entry is incremented. Similarly for entry removal; - this of course is still fragile and can be broken, but that's enough for most cases (rm -r and friends work, for example); - additional complexities due to non-unique keys in the tree and NFS (where we don't have persistent struct file for a directory) are left as an exercise for a reader. Nikita. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 18:19 ` Theodore Ts'o 2005-09-15 21:04 ` Anton Altaparmakov @ 2005-09-15 21:47 ` Jörn Engel 2005-09-16 7:29 ` Nikita Danilov 2 siblings, 0 replies; 32+ messages in thread From: Jörn Engel @ 2005-09-15 21:47 UTC (permalink / raw) To: Theodore Ts'o Cc: Jan Blunck, Anton Altaparmakov, miklos, aaranya, linux-fsdevel On Thu, 15 September 2005 14:19:47 -0400, Theodore Ts'o wrote: > On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote: > > Correct. But you're missing half of the problem. > > > > The other half is getting O(1) or O(ln(n)) performance for lookup on > > a large directory. Solution to this half alone would be to use a > > balanced tree for directory entries. Now, using a self-balancing tree > > would not solve your half of the problem, so it is not a full > > solution. > > > > Ext2, as far as I understand it, solved both halves by having both a > > hashtree and the sequential list. Getdents is walking the list and > > ignoring entries with the "deleted" flag. Lookup is using the > > hashtree, which points to the list for actual data. > > Actually, no. What we do is return the directory entries in hash sort > order, by walking the btree. And we use the hash as the offset > returned in d_off and via telldir(). The reason for this? So that if > we add files which causes a node to split, that we still only return > files in the directory once and only once. If we traversed the tree > in sequential order only, then some files could be returned multiple > times, and some files returned not at all after a node split > operation. You are right. I understood what ext2 was doing, saw a huge problem in lookup performance (O(n)) and was given a very brief explanation of hashtrees - which apparently did not match reality. Still wouldn't a hybrid design with both a list and a tree work? It appears to be just a different style of hair, not any prettier and just as sufficient to cover all bald spots. Jörn -- A quarrel is quickly settled when deserted by one party; there is no battle unless there be two. -- Seneca - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 18:19 ` Theodore Ts'o 2005-09-15 21:04 ` Anton Altaparmakov 2005-09-15 21:47 ` Jörn Engel @ 2005-09-16 7:29 ` Nikita Danilov 2005-09-16 11:58 ` Theodore Ts'o 2 siblings, 1 reply; 32+ messages in thread From: Nikita Danilov @ 2005-09-16 7:29 UTC (permalink / raw) To: Theodore Ts'o Cc: Jan Blunck, Anton Altaparmakov, miklos, aaranya, linux-fsdevel Theodore Ts'o writes: > On Thu, Sep 15, 2005 at 07:46:58PM +0200, J?rn Engel wrote: > > Correct. But you're missing half of the problem. > > > > The other half is getting O(1) or O(ln(n)) performance for lookup on > > a large directory. Solution to this half alone would be to use a > > balanced tree for directory entries. Now, using a self-balancing tree > > would not solve your half of the problem, so it is not a full > > solution. > > > > Ext2, as far as I understand it, solved both halves by having both a > > hashtree and the sequential list. Getdents is walking the list and > > ignoring entries with the "deleted" flag. Lookup is using the > > hashtree, which points to the list for actual data. > > Actually, no. What we do is return the directory entries in hash sort > order, by walking the btree. And we use the hash as the offset > returned in d_off and via telldir(). The reason for this? So that if > we add files which causes a node to split, that we still only return > files in the directory once and only once. If we traversed the tree Except for hash collided directory entries, that are returned multiple times, because after seekdir() ext2_readdir() restarts from the start of hash-bucket, right? Nikita. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-16 7:29 ` Nikita Danilov @ 2005-09-16 11:58 ` Theodore Ts'o 0 siblings, 0 replies; 32+ messages in thread From: Theodore Ts'o @ 2005-09-16 11:58 UTC (permalink / raw) To: Nikita Danilov Cc: Jan Blunck, Anton Altaparmakov, miklos, aaranya, linux-fsdevel On Fri, Sep 16, 2005 at 11:29:29AM +0400, Nikita Danilov wrote: > > Actually, no. What we do is return the directory entries in hash sort > > order, by walking the btree. And we use the hash as the offset > > returned in d_off and via telldir(). The reason for this? So that if > > we add files which causes a node to split, that we still only return > > files in the directory once and only once. If we traversed the tree > > Except for hash collided directory entries, that are returned multiple > times, because after seekdir() ext2_readdir() restarts from the start of > hash-bucket, right? Yes, it's not perfect. But if you don't use telldir()/seekdir(), readdir() will return files in a directory once and only once. Actually, there's another exception, but that only happens when you have a hash collision, readdir() is in the middle of returning multiple directory entries all with the same hash, and at that moment the node containing the hash collisions gets split. This is quite rare, and pretty much impossible to trigger deliberately --- because the hash algorithm uses a filesystem specific secret to prevent attackers from deliberately creating files that could cause a hash collision, and potentially cause applications to misbehave. If you do use telldir/seekdir, and f_pos is pointing at a multiple directories with the same hash, then yes, you could get some repeats. But that's the best we can do given the horrific POSIX interface. You can solve the problem 100% perfectly if you adjust the filesystem format so there is a separate b-tree maintained solely for the purpose of keeping the telldir/seekdir indexes unique, and so you can traverse the tree in telldir index order. But of course the downside of that any operation that modifies the directory has extra overhead because there is an extra b-tree on disk that has to be kept up to date. If you have 64-bit telldir() cookies, and can count on being able to use 64-bit cookies for NFS, this also makes life easier (but still not 100% perfect); but there are a lot of 32-bit only systems still left in the world. - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 17:46 ` Jörn Engel 2005-09-15 18:19 ` Theodore Ts'o @ 2005-09-15 21:00 ` Anton Altaparmakov 2005-09-15 21:15 ` Charles P. Wright 1 sibling, 1 reply; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 21:00 UTC (permalink / raw) To: Jörn Engel; +Cc: Jan Blunck, miklos, aaranya, linux-fsdevel [-- Attachment #1: Type: TEXT/PLAIN, Size: 3104 bytes --] On Thu, 15 Sep 2005, [iso-8859-1] Jörn Engel wrote: > On Thu, 15 September 2005 18:41:10 +0200, Jan Blunck wrote: > > > > No, thats not true. The file position in a regular (flat) file is completely > > up to the user. "Thanks" to the short-comings of POSIX on not defining the > > functions on directories, the file position in a directory is completely up to > > the file system. Due to that fact, modern file systems don't need to provide > > a "flat" view (in terms of sequential access to a directory file) on > > them. Although, the file system must guarantee, that readdir's must not fail > > because of changes directory contents. AFAIR ext2 is achieving this through > > not shrinking directories while the filesystem is online. > > Correct. But you're missing half of the problem. > > The other half is getting O(1) or O(ln(n)) performance for lookup on > a large directory. Solution to this half alone would be to use a > balanced tree for directory entries. Now, using a self-balancing tree > would not solve your half of the problem, so it is not a full > solution. > > Ext2, as far as I understand it, solved both halves by having both a > hashtree and the sequential list. Getdents is walking the list and > ignoring entries with the "deleted" flag. Lookup is using the > hashtree, which points to the list for actual data. That is not too far from what ntfs does, too. It is a b tree (i.e. analogous to the ext2 hashtree) but on disk it is a flat file (analogous to the ext2 sequential list). The problem comes when you add or delete an entry the order of everything in the flat file is suddenly changed (compare it to taking the linked list apart and putting it back together in a different order). ntfs_readdir() instead of walking the b tree, simply reads the linear file and returns the entries in the order it finds them on disk. This is very efficient as it involves no seeking around, no tree walking, or anything like that. So there is no way to tell what the next entry should have been before the removal/addition as it may now be a completely different one. After all this cursing at the standards I think I may have found a way but it cuases readdir to be very inefficient. It might work to walk the b tree instead of just returning entries from the flat directory file data and to take a reference on the directory entry corresponding to f_pos so it cannot go away and then when readdir is called again resume from there in the b tree. The only problem again is when the reference to the directory entry will be released in the case that no further readdir and no close ever comes. It causes the fs to be unmountable for example and the file belonging to the directory entry to be undeletable... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 21:00 ` Anton Altaparmakov @ 2005-09-15 21:15 ` Charles P. Wright 2005-09-15 21:19 ` Anton Altaparmakov 0 siblings, 1 reply; 32+ messages in thread From: Charles P. Wright @ 2005-09-15 21:15 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: aaranya, linux-fsdevel On Thu, 2005-09-15 at 22:00 +0100, Anton Altaparmakov wrote: > The only problem again is when the reference to the > directory entry will be released in the case that no further readdir and > no close ever comes. It causes the fs to be unmountable for example > and the file belonging to the directory entry to be undeletable... Anton, If no further readdir "ever" comes, and the file is never closed, the file system will not be unmountable to begin with. It is just like opening and never closing a normal file, you keep the file system busy. I don't see the connection between the extra per-open-file state you need to save, and the argument that this opens up up for a DoS. Charles ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 21:15 ` Charles P. Wright @ 2005-09-15 21:19 ` Anton Altaparmakov 0 siblings, 0 replies; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 21:19 UTC (permalink / raw) To: Charles P. Wright; +Cc: aaranya, linux-fsdevel On Thu, 15 Sep 2005, Charles P. Wright wrote: > On Thu, 2005-09-15 at 22:00 +0100, Anton Altaparmakov wrote: > > The only problem again is when the reference to the > > directory entry will be released in the case that no further readdir and > > no close ever comes. It causes the fs to be unmountable for example > > and the file belonging to the directory entry to be undeletable... > Anton, > > If no further readdir "ever" comes, and the file is never closed, the > file system will not be unmountable to begin with. It is just like > opening and never closing a normal file, you keep the file system busy. Er, yes, that is true. > I don't see the connection between the extra per-open-file state you > need to save, and the argument that this opens up up for a DoS. When I say DoS I mean "user triggerable OOM" on a large dirctory. Say you have a directory with 2^30 files in it and you need to snapshot it or something each time a user does opendir() + readir(). You would OOM very quickly... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 16:41 ` Jan Blunck 2005-09-15 17:46 ` Jörn Engel @ 2005-09-15 20:28 ` Anton Altaparmakov 1 sibling, 0 replies; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 20:28 UTC (permalink / raw) To: Jan Blunck; +Cc: miklos, aaranya, linux-fsdevel On Thu, 15 Sep 2005, Jan Blunck wrote: > On Thu, Sep 15, Anton Altaparmakov wrote: > > That is completely untrue. It is up to the user. The file position in > > a directory is completely up to the caller and in particular when a > > directory is modified the file position in that directory is not changed > > by the filesystem. Doing this would in fact be impossible since the > > creation/deletion in the kernel has no access to all the "struct file"s > > with which a directory has been opened. > > No, thats not true. The file position in a regular (flat) file is completely > up to the user. "Thanks" to the short-comings of POSIX on not defining the > functions on directories, the file position in a directory is completely up to > the file system. Due to that fact, modern file systems don't need to provide > a "flat" view (in terms of sequential access to a directory file) on We are talking past each other and actually agreeing with each other. Our definitions of "up to the user" and "up to the fs" are at opposites. > them. Although, the file system must guarantee, that readdir's must not fail > because of changes directory contents. AFAIR ext2 is achieving this through > not shrinking directories while the filesystem is online. ntfs uses B trees on disk for directories. The tree must be balanced at all times and each insert/delete will be a balancing tree operation. This can completely change the tree shape and hence the on disk data. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 15:17 ` Anton Altaparmakov 2005-09-15 16:41 ` Jan Blunck @ 2005-09-15 16:51 ` Miklos Szeredi 2005-09-15 21:17 ` Anton Altaparmakov 1 sibling, 1 reply; 32+ messages in thread From: Miklos Szeredi @ 2005-09-15 16:51 UTC (permalink / raw) To: aia21; +Cc: miklos, aaranya, linux-fsdevel > Then that is broken, too. But have you ever looked? I looked at the strace yes. > I just did and you are partially wrong. It especially handles the > case correctly by doing a "rewinddir()" on the directory before > doing another readdir() after it has done an unlike. However you > are right in that it only enables this code at ./configure time so > on most Linux systems will not have it enabled in which case it will > indeed be broken on filesystems like ntfs. Well, having a configure option is not very consoling for the average luser faced with a non-working rm binary. > That is completely untrue. It is up to the user. I think the normal behavior for a filesystem is that for each entry which existed at the call to opendir, and continues to exist until the whole directory is read, will be returned exactly once. For any other case (entry added or removed during the directory reading) it is unspecified whether the entry is returned once or not returned. SUS doesn't say this in so many words, but there are hints: The type DIR, which is defined in the <dirent.h> header, represents a directory stream, which is an ordered sequence of all the directory entries in a particular directory. Directory entries represent files; files may be removed from a directory or added to a directory asynchronously to the operation of readdir(). [...] If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. > The file position in a directory is completely up to the caller and > in particular when a directory is modified the file position in that > directory is not changed by the filesystem. Doing this would in > fact be impossible since the creation/deletion in the kernel has no > access to all the "struct file"s with which a directory has been > opened. And does not need to. The only thing the filesystem must do is to guarantee, that IF the file position wasn't changed since the last getdents() call (i.e. no lseek(), or zero relative offset lseek() was done), the directory reading should continue from the place where it has been left off. Look at dcache_readdir() in fs/libfs.c for a way to deal with exactly this problem. Miklos ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 16:51 ` Miklos Szeredi @ 2005-09-15 21:17 ` Anton Altaparmakov 0 siblings, 0 replies; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 21:17 UTC (permalink / raw) To: Miklos Szeredi; +Cc: aaranya, linux-fsdevel On Thu, 15 Sep 2005, Miklos Szeredi wrote: > > The file position in a directory is completely up to the caller and > > in particular when a directory is modified the file position in that > > directory is not changed by the filesystem. Doing this would in > > fact be impossible since the creation/deletion in the kernel has no > > access to all the "struct file"s with which a directory has been > > opened. > > And does not need to. The only thing the filesystem must do is to > guarantee, that IF the file position wasn't changed since the last > getdents() call (i.e. no lseek(), or zero relative offset lseek() was > done), the directory reading should continue from the place where it > has been left off. But how do you define "the place where it has been left off"? What you are saying is hand waving. No actual meaning behind those words. Example: f_pos = logical byte offset inside the linear on-disk directory stream. (This is what ntfs uses.) So if you do not change f_pos, the next readdir() will continue at the same logical byte offset inside the linear on-disk directory stream, i.e. it will continue at "the place where it has been left off". Which is what you specified! Notice the problem? There is no guarantee whatsoever that after a rm or create, this offset inside the on-disk directory stream points to the same directory entry any more since the entries will have been moved around. Even worse the very entry it was pointing to may have been removed altogether so its place will have been filled by a different entry. > Look at dcache_readdir() in fs/libfs.c for a way to deal with exactly > this problem. AFAICS this only works because it is a linked list. But I am rather tired now so I may well be missing something. That function is not very easy to understand... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 14:12 ` Anton Altaparmakov 2005-09-15 14:45 ` Miklos Szeredi @ 2005-09-15 15:51 ` Theodore Ts'o 2005-09-15 16:52 ` Bryan Henderson 2005-09-15 20:25 ` Anton Altaparmakov 2005-09-15 18:08 ` Nikita Danilov 2 siblings, 2 replies; 32+ messages in thread From: Theodore Ts'o @ 2005-09-15 15:51 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Akshat Aranya, linux-fsdevel On Thu, Sep 15, 2005 at 03:12:38PM +0100, Anton Altaparmakov wrote: > Oops. I forgot to answer your question. Yes, the filesystem needs to > consider the offset value in the second readdir to still be valid. You > cannot keep rewinding back to zero every time you make a modification or > you would keep returning entries you have already returned and never > make any progress if e.g. some user does this in a loop at the same > time: POSIX (or SUSv3) does not guarantee the offset data structure to be the dirent structure at all. So a portable application should not count of d_off on being present. That being said, it *is* fair game to assume that an application should be able to call readdir() repeatedly and get all files in the directory once and exactly once, even if another process is unlinking files or adding files while the readdir is going on. The only thing which is unspecified is whether a file which is deleted or added after the application has started iterating over the directory will be included or not. (Think about it; Unix is a multi-user, time-sharing system. Nothing else makes sense, since otherwise programs that used readdir() would randomly break if a directory is modified by another process at the same time.) In fact, POSIX requires that telldir() and seekdir() do the right thing even if directory entries are added or deleted between the telldir() and seekdir(). Yes, this is hard on directories which use something more sophisticated a simple linked list to store their directory entries (like a b-tree, for example). However, it is required by POSIX/SUSv3. The JFS filesystem, for example, uses an entirely separate b-tree just to guarantee telldir() and seekdir() indexes behave properly in the presence of file inserts and removals. > Bonnie++'s code is just complete crap... It is the author's fault that > it will not work on filesystems where the directory entries are not in > fixed locations... If Bonnie++ is relying on d_off, then yet. But in fact, if Bonniee++ is just doing a series of readdir()'s, and the filesystem doesn't do the right thing in the face of concurrent deletes or file creates, it is in fact the filesystem which is broken. It doesn't matter if the filesystem is using a sophisticated b-tree data structure; it still has to do the right thing. There is a lot of hair in ext3, jfs, xfs, reiserfs, etc. in order to guarantee this to be the case, since it is expected by Unix applications, and it is required by the standards specifications. (I often curse the POSIX specifiers for including telldir/seekdir into the standards, since it's hell to support, but it's there, and there are applications which rely on it --- unfortunately.) - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 15:51 ` Theodore Ts'o @ 2005-09-15 16:52 ` Bryan Henderson 2005-09-15 16:57 ` Jeremy Allison 2005-09-15 20:50 ` Anton Altaparmakov 2005-09-15 20:25 ` Anton Altaparmakov 1 sibling, 2 replies; 32+ messages in thread From: Bryan Henderson @ 2005-09-15 16:52 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Akshat Aranya, Anton Altaparmakov, linux-fsdevel Since others have opined differently, I'd like to add my voice to Ted's and say his explanation is exactly my understanding too; I couldn't have said it any better, right down to the part of cursing our reality. We hate to admit that this is the requirement of directory positions, because it's so hard to implement, but it's what is expected. When I explain the directory positioning requirement (which actually shows up in lots of places in software design besides filesystem directories), I always say the ideal is to think of the directory position as a white card within a deck of playing cards. When you say "tell me about the next card," you get the card directly on top of the white card and that card then moves under the white card. People can insert and delete cards all around the white card, but the sequence is still well defined. In actuality, it's acceptable to fall short of that ideal and ignore some or all insertions or deletions that happened after a pass started, and many implementations do. But what you'd have a really hard time getting accepted is returning the same entry twice or skipping an entry that was always there, as someone steps through the directory. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 16:52 ` Bryan Henderson @ 2005-09-15 16:57 ` Jeremy Allison 2005-09-15 20:51 ` Anton Altaparmakov 2005-09-15 20:50 ` Anton Altaparmakov 1 sibling, 1 reply; 32+ messages in thread From: Jeremy Allison @ 2005-09-15 16:57 UTC (permalink / raw) To: Bryan Henderson Cc: Theodore Ts'o, Akshat Aranya, Anton Altaparmakov, linux-fsdevel On Thu, Sep 15, 2005 at 09:52:33AM -0700, Bryan Henderson wrote: > Since others have opined differently, I'd like to add my voice to Ted's > and say his explanation is exactly my understanding too; I couldn't have > said it any better, right down to the part of cursing our reality. Well the reality is there are applications out there that create more than 1,000,000 files in a directory, and in order to support that efficiently having a working seekdir()/telldir() is a *MUST*. Jeremy. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 16:57 ` Jeremy Allison @ 2005-09-15 20:51 ` Anton Altaparmakov 0 siblings, 0 replies; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 20:51 UTC (permalink / raw) To: Jeremy Allison Cc: Bryan Henderson, Theodore Ts'o, Akshat Aranya, linux-fsdevel On Thu, 15 Sep 2005, Jeremy Allison wrote: > On Thu, Sep 15, 2005 at 09:52:33AM -0700, Bryan Henderson wrote: > > Since others have opined differently, I'd like to add my voice to Ted's > > and say his explanation is exactly my understanding too; I couldn't have > > said it any better, right down to the part of cursing our reality. > > Well the reality is there are applications out there that > create more than 1,000,000 files in a directory, and in > order to support that efficiently having a working > seekdir()/telldir() is a *MUST*. And exactly because of such large directories you cannot provide either a readdir or seekdir/telldir without some kind of locking on the dircetory you are working with. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 16:52 ` Bryan Henderson 2005-09-15 16:57 ` Jeremy Allison @ 2005-09-15 20:50 ` Anton Altaparmakov 2005-09-15 23:41 ` Bryan Henderson 1 sibling, 1 reply; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 20:50 UTC (permalink / raw) To: Bryan Henderson; +Cc: Theodore Ts'o, Akshat Aranya, linux-fsdevel On Thu, 15 Sep 2005, Bryan Henderson wrote: > Since others have opined differently, I'd like to add my voice to Ted's > and say his explanation is exactly my understanding too; I couldn't have > said it any better, right down to the part of cursing our reality. > > We hate to admit that this is the requirement of directory positions, > because it's so hard to implement, but it's what is expected. When I > explain the directory positioning requirement (which actually shows up in > lots of places in software design besides filesystem directories), I > always say the ideal is to think of the directory position as a white card > within a deck of playing cards. When you say "tell me about the next > card," you get the card directly on top of the white card and that card > then moves under the white card. People can insert and delete cards all > around the white card, but the sequence is still well defined. > > In actuality, it's acceptable to fall short of that ideal and ignore some > or all insertions or deletions that happened after a pass started, and > many implementations do. But what you'd have a really hard time getting > accepted is returning the same entry twice or skipping an entry that was > always there, as someone steps through the directory. That may be so but AFAICS it is impossible to implement without causing a user triggerable OOM/DOS when you have a filesystem where directory entries are static but are a B tree for example. The problem comes from the fact that there is no way to tell for how long that white card is valid to use your analogy. You would have to create one the first time you see a readdir() from a particular file descriptor on the directory. But when are you free to get rid of it? There is no way to tell. When the directory is closed I hear you say, but what if it never gets closed because the user is malicious? Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 20:50 ` Anton Altaparmakov @ 2005-09-15 23:41 ` Bryan Henderson 0 siblings, 0 replies; 32+ messages in thread From: Bryan Henderson @ 2005-09-15 23:41 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Akshat Aranya, aia21, linux-fsdevel, Theodore Ts'o >The problem comes from >the fact that there is no way to tell for how long that white card is >valid to use your analogy. You would have to create one the first time >you see a readdir() from a particular file descriptor on the directory. >But when are you free to get rid of it? So it's analogous to the file pointer for a file - as long as you have a file open, you're occupying kernel memory because the kernel has to remember where you last read so it knows where to read next. That's why systems have resource limits on things like number of open files per process, number of processes per system, etc. They don't actually close the hole, but they're apparently good enough. It could still be an issue in Linux implementation, though. ISTR the filesystem drivers don't see the directory read sessions, which means they can't keep private stream state, so that the only information they can use to find a place in a directory sequence is that little position integer. People have found ways to make that enough information to fully describe the white card, but I know it isn't easy. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 15:51 ` Theodore Ts'o 2005-09-15 16:52 ` Bryan Henderson @ 2005-09-15 20:25 ` Anton Altaparmakov 2005-09-16 3:39 ` Theodore Ts'o 1 sibling, 1 reply; 32+ messages in thread From: Anton Altaparmakov @ 2005-09-15 20:25 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Akshat Aranya, linux-fsdevel On Thu, 15 Sep 2005, Theodore Ts'o wrote: > On Thu, Sep 15, 2005 at 03:12:38PM +0100, Anton Altaparmakov wrote: > > Oops. I forgot to answer your question. Yes, the filesystem needs to > > consider the offset value in the second readdir to still be valid. You > > cannot keep rewinding back to zero every time you make a modification or > > you would keep returning entries you have already returned and never > > make any progress if e.g. some user does this in a loop at the same > > time: > > POSIX (or SUSv3) does not guarantee the offset data structure to be > the dirent structure at all. So a portable application should not > count of d_off on being present. Why should f_pos be the dirent structure?!? That would be completely insane... > That being said, it *is* fair game to assume that an application > should be able to call readdir() repeatedly and get all files in the > directory once and exactly once, even if another process is unlinking > files or adding files while the readdir is going on. The only thing I disagree. readdir() is a completely brain damaged interface and it is not fair game to assume that at all... > which is unspecified is whether a file which is deleted or added after > the application has started iterating over the directory will be > included or not. (Think about it; Unix is a multi-user, time-sharing > system. Nothing else makes sense, since otherwise programs that used > readdir() would randomly break if a directory is modified by another > process at the same time.) Only if they are written badly! Also it depends what you mean by break. For example "while (i = readdir); do rm i; done" would not break, it would simply miss some files. It would not produce an error. If it were properly written when it is finished it would check if there are still things to delete and start again if so and keep looping until there are none left. Anything else _cannot_ work unless opendir() results in a read_lock on the directory and it is only unlocked on close. Nothing else is sane and will result in a trivial DOS by any user on the system where a fs has to play tricks. > In fact, POSIX requires that telldir() and seekdir() do the right > thing even if directory entries are added or deleted between the > telldir() and seekdir(). Yes, this is hard on directories which use Sorry what do you mean? They will and can work fine. You use telldir to give you and offset (f_pos) and seekdir puts the offset into f_pos. Nothing more nothing less. If you have removed files or added files in between two readdir calls (irrelevant whether you used seek/telldir) the f_pos will just now point in the wrong place and you will get some entries duplicated or you will miss some because you did not rewind back to 0 after the change and because the directory was not locked against modifications. > something more sophisticated a simple linked list to store their > directory entries (like a b-tree, for example). However, it is Yes, ntfs uses a B tree. > required by POSIX/SUSv3. The JFS filesystem, for example, uses an Er, have you read it? To quote from "IEEE Std 1003.1, 2004 Edition", seekdir, from the informative "rationale" section: <quote> The original standard developers perceived that there were restrictions on the use of the seekdir() and telldir() functions related to implementation details, and for that reason these functions need not be supported on all POSIX-conforming systems. They are required on implementations supporting the XSI extension. One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered. The definition of seekdir() and telldir() does not specify whether, when using these interfaces, a given directory entry will be seen at all, or more than once. On systems not supporting these functions, their capability can sometimes be accomplished by saving a filename found by readdir() and later using rewinddir() and a loop on readdir() to relocate the position from which the filename was saved. </quote> Thus any application relying on f_pos in a directory to be meaningful is broken by design and even POSIX says so. Heck seek/telldir is not even required in POSIX unless you implement he XSI extension (I admit I have no idea what XSI is)! (It says so above...) > entirely separate b-tree just to guarantee telldir() and seekdir() > indexes behave properly in the presence of file inserts and removals. So any user can cause DOS/OOM by doing a: "while 1; do opendir(); readdir(); done" on a really big directory (note how I am never closing the directory)... What a fantastic filesystem that is! All the sheep are jumping off the bridge, lets jump, too! I think not... > > Bonnie++'s code is just complete crap... It is the author's fault that > > it will not work on filesystems where the directory entries are not in > > fixed locations... > > If Bonnie++ is relying on d_off, then yet. But in fact, if Bonniee++ > is just doing a series of readdir()'s, and the filesystem doesn't do > the right thing in the face of concurrent deletes or file creates, it > is in fact the filesystem which is broken. It doesn't matter if the I still disagree. The standards are broken if they require that from readdir(). Obviously at least someone understands given the seekdir() description. > filesystem is using a sophisticated b-tree data structure; it still > has to do the right thing. There is a lot of hair in ext3, jfs, xfs, > reiserfs, etc. in order to guarantee this to be the case, since it is > expected by Unix applications, and it is required by the standards > specifications. > > (I often curse the POSIX specifiers for including telldir/seekdir into > the standards, since it's hell to support, but it's there, and there > are applications which rely on it --- unfortunately.) seekdir()/telldir() are no problem as they are meaningless and POSIX agrees. readdir() is the problem. It is _impossible_ to do what POSIX demands using readdir without some form of lock to say "directory cannot be modified". Or if not a lock then a snapshot. That is exactly what it is asking for! I guess that would be the only way to support it. Snapshot the directory and internally queue all modifications or apply them using COW. But the problem even then is hwo do you know when the user has finished calling readdir(). There is no guarantee they will keep going till EOD is reached. There is not even any guarantee the user will close the directory they opened. Again, this would be a DOS and cause OOM in no time on a huge directory. Maybe I am missing something... How would you suggest to work around the above described problems? Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 20:25 ` Anton Altaparmakov @ 2005-09-16 3:39 ` Theodore Ts'o 2005-09-16 11:57 ` Dave Kleikamp 0 siblings, 1 reply; 32+ messages in thread From: Theodore Ts'o @ 2005-09-16 3:39 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Akshat Aranya, linux-fsdevel On Thu, Sep 15, 2005 at 09:25:55PM +0100, Anton Altaparmakov wrote: > > POSIX (or SUSv3) does not guarantee the offset data structure to be > > the dirent structure at all. So a portable application should not > > count of d_off on being present. > > Why should f_pos be the dirent structure?!? That would be completely > insane... The only fields that are guaranted to be in the dirent structure by the POSIX specification are: ino_t d_ino; char d_name[]; No other fields must be provided by a POSIX-compliant implementation. Hence, a portable application must not use d_off. It may not be present on all POSIX-compliant operating systems. > > which is unspecified is whether a file which is deleted or added after > > the application has started iterating over the directory will be > > included or not. (Think about it; Unix is a multi-user, time-sharing > > system. Nothing else makes sense, since otherwise programs that used > > readdir() would randomly break if a directory is modified by another > > process at the same time.) > > Only if they are written badly! Also it depends what you mean by break. > For example "while (i = readdir); do rm i; done" would not break, it would > simply miss some files. It would not produce an error. It would mean that applications would behave unreliable if a directory is changing while it is calling readdir(). I would call that breaking. > If it were properly written when it is finished it would check if there > are still things to delete and start again if so and keep looping until > there are none left. It might not be deleting files; it might just be searching all of the files in a directory, so having some (potentially large) number of files not be searched just because a file happened to be added to a directory is BAD. It is a silent error which an application can not detect. > Anything else _cannot_ work unless opendir() results in a read_lock on the > directory and it is only unlocked on close. Nothing else is sane and will > result in a trivial DOS by any user on the system where a fs has to play > tricks. Not true. A clever fs can play tricks that will not result in a denial of service attack. > > In fact, POSIX requires that telldir() and seekdir() do the right > > thing even if directory entries are added or deleted between the > > telldir() and seekdir(). Yes, this is hard on directories which use > > Sorry what do you mean? They will and can work fine. You use telldir to > give you and offset (f_pos) and seekdir puts the offset into f_pos. telldir() and seekdir() do not necessarily have to return offsets. They simply have to return a cookie which can be used to return to that particular location. For example, ext3/htree uses the hash used the sort-key in the hashed b-tree as the cookie. You're right. The 1990 Posix specification does not guarantee that telldir()/seekdir() exists; unfortunately a sufficiently large number of applications (including Samba) assume that it exists, so if you don't implement it correctly, your filesystem will not be usable for those applications. > > [JFS uses an] > > entirely separate b-tree just to guarantee telldir() and seekdir() > > indexes behave properly in the presence of file inserts and removals. > > So any user can cause DOS/OOM by doing a: "while 1; do opendir(); > readdir(); done" on a really big directory (note how I am never closing > the directory)... What a fantastic filesystem that is! All the sheep are > jumping off the bridge, lets jump, too! I think not... First of all, the separate b-tree is maintained on disk, as a permanent part of the filesystem metadata. When a directory entry is added to a directory, it creates a unique seekdir index which is added to the seekdir/telldir b-tree. This b-tree is used for nothing else, and readdir() returns directory entries in the seekdir-index order, by walking the seekdir-index b-tree. If a program does "while (1); do opendir(); readdir(); done", the opendir will eventually return an error when you consume all available file descriptors. It is no different from "while (1); do open(); done". > readdir() is the problem. It is _impossible_ to do what POSIX demands > using readdir without some form of lock to say "directory cannot be > modified". Or if not a lock then a snapshot. That is exactly what it is > asking for! I guess that would be the only way to support it. Snapshot > the directory and internally queue all modifications or apply them using > COW. But the problem even then is hwo do you know when the user has > finished calling readdir(). There is no guarantee they will keep going > till EOD is reached. There is not even any guarantee the user will close > the directory they opened. Again, this would be a DOS and cause OOM in no > time on a huge directory. Not so old Chinese saying... "Man who says something is impossible should not interrupt man doing it". Ext3 with hashed-trees does not use a lock to prevent directory modifications, and yet can guarantee that each file in the directory is returned once and only once, even if nodes in the tree need to be split. Yes, it requires some cleverness in the implementation, but it _can_ be done. JFS also uses a b-tree (or more than one b-tree) to index its directory, and also gives the same guarantee, but it implements it in an entirely different way. Two different filesystems; two different strategies; both implement this guarantee without needing to lock the directory against modifications during the readdir() scan. Take a look at the implementations closely. (Hint: both involve walking the b-tree and returning readdir() entries in tree order. The details though of which b-tree, how the b-tree is indexed, how the index is stored in file descriptor offset, and how telldir/seekdir are implemented are different in the two filesystems, however.) - Ted ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-16 3:39 ` Theodore Ts'o @ 2005-09-16 11:57 ` Dave Kleikamp 0 siblings, 0 replies; 32+ messages in thread From: Dave Kleikamp @ 2005-09-16 11:57 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Anton Altaparmakov, Akshat Aranya, linux-fsdevel On Thu, 2005-09-15 at 23:39 -0400, Theodore Ts'o wrote: > JFS also uses a b-tree (or more than one b-tree) to index its > directory, and also gives the same guarantee, but it implements it in > an entirely different way. Here are the detail of jfs's implementation. Each directory entry is assigned sequential "index" when it is created, which is stored in the directory entry in the dtree (the primary directory data structure). In addition, an entry is created in the "index table", which is an array keyed on the index. Initially, this table can be stored directly in the inode, but as it grows, the table is stored in other blocks, and is addressed by a b-tree anchored in the inode. For an existing entry, the table stores the address of the dtree page, and the offset within the page. This table is updated whenever a change to the directory moves the entry. When an entry is deleted, the table contains the index of the next existing entry in dtree-order. Resuming a search means getting the table element for the index, checking a flag whether it exists or has been deleted, following the chain of "next" indices until an existing one is found, and getting the dtree page address and offset of the entry. The disadvantages are the need to update the directory table as entries are moved, and the fact that the index table does not shrink as entries are removed. (The table is completely truncated if all entries are removed from a directory, but that isn't typically going to happen.) > Two different filesystems; two different strategies; both implement > this guarantee without needing to lock the directory against > modifications during the readdir() scan. Take a look at the > implementations closely. > > (Hint: both involve walking the b-tree and returning readdir() entries > in tree order. The details though of which b-tree, how the b-tree is > indexed, how the index is stored in file descriptor offset, and how > telldir/seekdir are implemented are different in the two filesystems, > however.) -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 14:12 ` Anton Altaparmakov 2005-09-15 14:45 ` Miklos Szeredi 2005-09-15 15:51 ` Theodore Ts'o @ 2005-09-15 18:08 ` Nikita Danilov 2005-09-16 11:23 ` Miklos Szeredi 2 siblings, 1 reply; 32+ messages in thread From: Nikita Danilov @ 2005-09-15 18:08 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: linux-fsdevel Anton Altaparmakov writes: [...] > > Bonnie++'s code is just complete crap... It is the author's fault that > it will not work on filesystems where the directory entries are not in > fixed locations... Unfortunately Single UNIX Specification states that "offset" (telldir(3)'s result) is only invalidated by calls to rewinddir(3). It is true that for certain (heh, almost _all_ non-s5fs-descending) file systems it is hard to maintain directory offsets stable in the face on concurrent directory modifications, but that's what current standards require and what user level software relies upon. > > Best regards, > > Anton Nikita. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 18:08 ` Nikita Danilov @ 2005-09-16 11:23 ` Miklos Szeredi 0 siblings, 0 replies; 32+ messages in thread From: Miklos Szeredi @ 2005-09-16 11:23 UTC (permalink / raw) To: nikita; +Cc: aia21, linux-fsdevel > > > > Bonnie++'s code is just complete crap... It is the author's fault that > > it will not work on filesystems where the directory entries are not in > > fixed locations... > > Unfortunately Single UNIX Specification states that "offset" > (telldir(3)'s result) is only invalidated by calls to rewinddir(3). And the fact the rewinddir() invalidates the offsets returned by telldir() is unfortunately not very useful, since the current kernel interface doesn't distingish seekdir() to to beginning of the stream (which _shoud not_ invalidate offsets) and rewiddir() (which _may_ invalidate offsets). IOW offsets returned by telldir() will have to be valid even after rewiddir(). Miklos ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Expected getdents behaviour 2005-09-15 14:07 ` Anton Altaparmakov 2005-09-15 14:12 ` Anton Altaparmakov @ 2005-09-16 1:28 ` tridge 1 sibling, 0 replies; 32+ messages in thread From: tridge @ 2005-09-16 1:28 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: Akshat Aranya, linux-fsdevel At the risk of thowing a spanner in the works, Samba relies pretty heavily on seekdir()/telldir()/unlink() interacting in "nice" ways. The most problematic case is OS/2 clients, which use a bizarre pattern for removing all files in a directory. I've coded up the equivalent set of posix directory operations for a OS/2 client deleting a directory in: http://samba.org/ftp/unpacked/junkcode/os2_delete.c what they do is: 1) open directory 2) read 100 entries, asking for the 'telldir' location of each 3) delete the first 4 of those entries 4) seekdir to the offset returned by the 4th entry (now deleted) 5) read another 100 entries 6) go to step 3 this continues until the read of the directory returns no entries. At this point the client assumes the directory is empty. This works fine on Linux. On FreeBSD systems the way libc telldir() works means that the above pattern uses a huge amount of memory, and ends up not deleting all entries in the directory, so we ended up replacing telldir() on FreeBSD inside Samba4. See: http://samba.org/ftp/unpacked/samba4/source/lib/replace/repdir/ So if this discussion leads to any changes in getdents(), please don't use the FreeBSD method (it allocates a linked list of all telldir() results, and doesn't correctly cope with deletions during a directory scan). Cheers, Tridge ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2005-09-16 11:58 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-09-15 13:57 Expected getdents behaviour Akshat Aranya 2005-09-15 14:03 ` Peter Staubach 2005-09-15 14:07 ` Anton Altaparmakov 2005-09-15 14:12 ` Anton Altaparmakov 2005-09-15 14:45 ` Miklos Szeredi 2005-09-15 15:17 ` Anton Altaparmakov 2005-09-15 16:41 ` Jan Blunck 2005-09-15 17:46 ` Jörn Engel 2005-09-15 18:19 ` Theodore Ts'o 2005-09-15 21:04 ` Anton Altaparmakov 2005-09-16 7:50 ` Nikita Danilov 2005-09-15 21:47 ` Jörn Engel 2005-09-16 7:29 ` Nikita Danilov 2005-09-16 11:58 ` Theodore Ts'o 2005-09-15 21:00 ` Anton Altaparmakov 2005-09-15 21:15 ` Charles P. Wright 2005-09-15 21:19 ` Anton Altaparmakov 2005-09-15 20:28 ` Anton Altaparmakov 2005-09-15 16:51 ` Miklos Szeredi 2005-09-15 21:17 ` Anton Altaparmakov 2005-09-15 15:51 ` Theodore Ts'o 2005-09-15 16:52 ` Bryan Henderson 2005-09-15 16:57 ` Jeremy Allison 2005-09-15 20:51 ` Anton Altaparmakov 2005-09-15 20:50 ` Anton Altaparmakov 2005-09-15 23:41 ` Bryan Henderson 2005-09-15 20:25 ` Anton Altaparmakov 2005-09-16 3:39 ` Theodore Ts'o 2005-09-16 11:57 ` Dave Kleikamp 2005-09-15 18:08 ` Nikita Danilov 2005-09-16 11:23 ` Miklos Szeredi 2005-09-16 1:28 ` tridge
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).