* Large directories and poor order correlation @ 2011-03-14 20:24 Phillip Susi 2011-03-14 20:37 ` Eric Sandeen 0 siblings, 1 reply; 19+ messages in thread From: Phillip Susi @ 2011-03-14 20:24 UTC (permalink / raw) To: linux-ext4@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 2692 bytes --] Shouldn't copying or extracting or otherwise populating a large directory of many small files at the same time result in a strong correlation between the order the names appear in the directory, and the order their data blocks are stored on disk, and thus, read performance should not be negatively impacted by fragmentation? Background: While migrating a server to a new system, I noticed that it was taking forever to rsync my Maildir. It seemed to be due to very low disk throughput due to seeking. I confirmed this with timing tests using both tar and dump to /dev/zero to test reading the files after dropping cache. I noticed that tar was horribly slow, and dump was much better. I surmise that this was due to poor correlation between the order of file names in the directory and their data blocks on disk. I would expect this on an old fs that has grown slowly over a few years, and that this would mostly go away after copying the files to a new system. I found some surprises. The big one being that after copying the files to the new system, they still have a poor correlation between directory and inode order. Details: The old system was a single disk with sequential throughput of 51 mb/s, and the new one is a 4 disk raid-5 with sequential throughput of 160 mb/s. On the old system, tar took 30 minutes, and dump took 8 minutes. On the new system, tar took 18 minutes, and dump took a mere 30 seconds! On just the linux-kernel Maildir, which has 85,364 files taking up 660M of space, dump on the old system clocks in at 11m41s and only 10s on the new system. I wrote a python script to actually measure the correlation between name and inode order, inode and data block order, and name to data block order. It enumerates the files and counts it as being either in or out of order by comparing the inode number to the last. I expected to see a much better correlation on the new system, but I did not. On the new system the linux-kernel Maildir gets these results: Name to inode correlation: 0.50002342908 Name to block correlation: 0.49996485638 Inode to block correlation: 0.889239023476 And on the old system: Name to inode correlation: 0.499531418397 Name to block correlation: 0.499554847477 Inode to block correlation: 0.987418583946 The other folders get similar results. You can see that the inode to block correlation is improved, but it wasn't very bad to begin with so going from 8 minutes to 30 seconds seems to be a good deal more improvement than this would explain. What concerns me though, is the name to inode correlation went from terrible to slightly worse, which is why tar still is horribly slow. Attaching the script for reference. [-- Attachment #2: correlation.py --] [-- Type: text/x-python, Size: 1390 bytes --] #!/usr/bin/python import os from stat import * import fcntl import array names = os.listdir('.') lastino = 0 name_to_ino_in = 0 name_to_ino_out = 0 lastblock = 0 name_to_block_in = 0 name_to_block_out = 0 iblocks = list() inode_to_block_in = 0 inode_to_block_out = 0 for file in names : try : st = os.stat(file) except OSError: continue if not S_ISREG(st.st_mode) : continue if st.st_ino > lastino : name_to_ino_in += 1 else : name_to_ino_out += 1 lastino = st.st_ino f = open(file) buf = array.array('I', [0]) err = fcntl.ioctl(f.fileno(), 1, buf) if err != 0 : print "ioctl failed on " + f block = buf[0] if block != 0 : if block > lastblock : name_to_block_in += 1 else : name_to_block_out += 1 lastblock = block iblocks.append((st.st_ino,block)) print "Name to inode correlation: " + str(float(name_to_ino_in) / float((name_to_ino_in + name_to_ino_out))) print "Name to block correlation: " + str(float(name_to_block_in) / float((name_to_block_in + name_to_block_out))) iblocks.sort() lastblock = 0 for i in iblocks: if i[1] > lastblock: inode_to_block_in += 1 else: inode_to_block_out += 1 lastblock = i[1] print "Inode to block correlation: " + str(float(inode_to_block_in) / float((inode_to_block_in + inode_to_block_out))) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-14 20:24 Large directories and poor order correlation Phillip Susi @ 2011-03-14 20:37 ` Eric Sandeen 2011-03-14 20:52 ` Phillip Susi 2011-03-15 7:59 ` Florian Weimer 0 siblings, 2 replies; 19+ messages in thread From: Eric Sandeen @ 2011-03-14 20:37 UTC (permalink / raw) To: Phillip Susi; +Cc: linux-ext4@vger.kernel.org On 3/14/11 3:24 PM, Phillip Susi wrote: > Shouldn't copying or extracting or otherwise populating a large > directory of many small files at the same time result in a strong > correlation between the order the names appear in the directory, and the > order their data blocks are stored on disk, and thus, read performance > should not be negatively impacted by fragmentation? No, because htree (dir_index) dirs returns names in hash-value order, not inode number order. i.e. "at random." As you say, sorting by inode number will work much better... -Eric > Background: > > While migrating a server to a new system, I noticed that it was taking > forever to rsync my Maildir. It seemed to be due to very low disk > throughput due to seeking. I confirmed this with timing tests using > both tar and dump to /dev/zero to test reading the files after dropping > cache. I noticed that tar was horribly slow, and dump was much better. > I surmise that this was due to poor correlation between the order of > file names in the directory and their data blocks on disk. I would > expect this on an old fs that has grown slowly over a few years, and > that this would mostly go away after copying the files to a new system. > I found some surprises. The big one being that after copying the files > to the new system, they still have a poor correlation between directory > and inode order. > > Details: > > The old system was a single disk with sequential throughput of 51 mb/s, > and the new one is a 4 disk raid-5 with sequential throughput of 160 mb/s. > > On the old system, tar took 30 minutes, and dump took 8 minutes. On the > new system, tar took 18 minutes, and dump took a mere 30 seconds! > > On just the linux-kernel Maildir, which has 85,364 files taking up 660M > of space, dump on the old system clocks in at 11m41s and only 10s on the > new system. > > I wrote a python script to actually measure the correlation between name > and inode order, inode and data block order, and name to data block > order. It enumerates the files and counts it as being either in or out > of order by comparing the inode number to the last. I expected to see a > much better correlation on the new system, but I did not. > > On the new system the linux-kernel Maildir gets these results: > > Name to inode correlation: 0.50002342908 > Name to block correlation: 0.49996485638 > Inode to block correlation: 0.889239023476 > > And on the old system: > > Name to inode correlation: 0.499531418397 > Name to block correlation: 0.499554847477 > Inode to block correlation: 0.987418583946 > > The other folders get similar results. You can see that the inode to > block correlation is improved, but it wasn't very bad to begin with so > going from 8 minutes to 30 seconds seems to be a good deal more > improvement than this would explain. What concerns me though, is the > name to inode correlation went from terrible to slightly worse, which is > why tar still is horribly slow. > > Attaching the script for reference. > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-14 20:37 ` Eric Sandeen @ 2011-03-14 20:52 ` Phillip Susi 2011-03-14 21:12 ` Eric Sandeen 2011-03-14 21:52 ` Ted Ts'o 2011-03-15 7:59 ` Florian Weimer 1 sibling, 2 replies; 19+ messages in thread From: Phillip Susi @ 2011-03-14 20:52 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4@vger.kernel.org On 3/14/2011 4:37 PM, Eric Sandeen wrote: > On 3/14/11 3:24 PM, Phillip Susi wrote: >> Shouldn't copying or extracting or otherwise populating a large >> directory of many small files at the same time result in a strong >> correlation between the order the names appear in the directory, and the >> order their data blocks are stored on disk, and thus, read performance >> should not be negatively impacted by fragmentation? > > No, because htree (dir_index) dirs returns names in hash-value > order, not inode number order. i.e. "at random." I thought that the htree was used to look up names, but the normal directory was used to enumerate them? In other words, the htree speeds up opening a single file, but slows down traversing the entire directory, so should not be used there. Also isn't htree only enabled for large directories? I still see crummy correlation for small ( < 100 files, even one with only 8 entries ) directories. It seems unreasonable to ask applications to read all directory entries, then sort them by inode number to achieve reasonable performance. This seems like something the fs should be doing. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-14 20:52 ` Phillip Susi @ 2011-03-14 21:12 ` Eric Sandeen 2011-03-14 21:52 ` Ted Ts'o 1 sibling, 0 replies; 19+ messages in thread From: Eric Sandeen @ 2011-03-14 21:12 UTC (permalink / raw) To: Phillip Susi; +Cc: linux-ext4@vger.kernel.org On 3/14/11 3:52 PM, Phillip Susi wrote: > On 3/14/2011 4:37 PM, Eric Sandeen wrote: >> On 3/14/11 3:24 PM, Phillip Susi wrote: >>> Shouldn't copying or extracting or otherwise populating a large >>> directory of many small files at the same time result in a strong >>> correlation between the order the names appear in the directory, and the >>> order their data blocks are stored on disk, and thus, read performance >>> should not be negatively impacted by fragmentation? >> >> No, because htree (dir_index) dirs returns names in hash-value >> order, not inode number order. i.e. "at random." > > I thought that the htree was used to look up names, but the normal > directory was used to enumerate them? In other words, the htree speeds > up opening a single file, but slows down traversing the entire > directory, so should not be used there. readdir uses htree / dir_index: ext3_readdir() if (EXT3_HAS_COMPAT_FEATURE(inode->i_sb, EXT3_FEATURE_COMPAT_DIR_INDEX) && ((EXT3_I(inode)->i_flags & EXT3_INDEX_FL) || ((inode->i_size >> sb->s_blocksize_bits) == 1))) { err = ext3_dx_readdir(filp, dirent, filldir); Because dir_index places entries into blocks in hash order, reading it "like a non-dir_index" dir still gives you out of order entries, I think. IOW it doesn't slow down readdir, it just gives you a nasty order - slowing down access to those files. > Also isn't htree only enabled for large directories? I still see crummy > correlation for small ( < 100 files, even one with only 8 entries ) > directories. Nope, it's used for all directories AFAIK. Certainly shows the most improvement on lookups in large directories though... > It seems unreasonable to ask applications to read all directory entries, > then sort them by inode number to achieve reasonable performance. This > seems like something the fs should be doing. Yeah, this has been a longstanding nastiness... -Eric ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-14 20:52 ` Phillip Susi 2011-03-14 21:12 ` Eric Sandeen @ 2011-03-14 21:52 ` Ted Ts'o 2011-03-14 23:43 ` Phillip Susi 1 sibling, 1 reply; 19+ messages in thread From: Ted Ts'o @ 2011-03-14 21:52 UTC (permalink / raw) To: Phillip Susi; +Cc: Eric Sandeen, linux-ext4@vger.kernel.org On Mon, Mar 14, 2011 at 04:52:21PM -0400, Phillip Susi wrote: > It seems unreasonable to ask applications to read all directory entries, > then sort them by inode number to achieve reasonable performance. This > seems like something the fs should be doing. Unfortunately the kernel can't do it, because a directory could be arbitrarily big, and kernel memory is non-swappable. In addition, what if a process opens a directory, starts calling readdir, pauses in the middle, and then holds onto it for days, weeks, or months? Combine that with POSIX requirements about how readdir() has to behave if files are added or deleted during a readdir() session (even a readdir session which takes days, weeks, or months), and it's a complete mess. It's not hard to provide library routines that do the right thing, and I have written an LD_PRELOAD which intercepts opendir() and readdir() calls and does the sorting in userspace. Perhaps the right answer is getting this into libc, but I have exactly two words for you: "Uhlrich Drepper". - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-14 21:52 ` Ted Ts'o @ 2011-03-14 23:43 ` Phillip Susi 2011-03-15 0:14 ` Ted Ts'o 0 siblings, 1 reply; 19+ messages in thread From: Phillip Susi @ 2011-03-14 23:43 UTC (permalink / raw) To: Ted Ts'o; +Cc: Eric Sandeen, linux-ext4@vger.kernel.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/14/2011 05:52 PM, Ted Ts'o wrote: > Unfortunately the kernel can't do it, because a directory could be > arbitrarily big, and kernel memory is non-swappable. In addition, Buffers/cache is discardable though. Or does the entire htree have to be kept in slab or something? > what if a process opens a directory, starts calling readdir, pauses in > the middle, and then holds onto it for days, weeks, or months? The same thing that happened before htree? > It's not hard to provide library routines that do the right thing, and > I have written an LD_PRELOAD which intercepts opendir() and readdir() > calls and does the sorting in userspace. Perhaps the right answer is > getting this into libc, but I have exactly two words for you: "Uhlrich > Drepper". Wouldn't it be better to just have readdir() use the main directory, which presumably is in a more sane ordering, instead of the htree? That way you don't have to burn cpu and ram sorting on every opendir(). Also, I have checked some smaller directories and lsattr reports they are NOT using indexing, yet still display poor correlation. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk1+qDkACgkQJ4UciIs+XuIktwCgi1u4T2x+igOw4feO0hNjzB9W liIAmwRBdPiZMSfWpzu4+40xJsNXzouQ =d4VX -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-14 23:43 ` Phillip Susi @ 2011-03-15 0:14 ` Ted Ts'o 2011-03-15 14:01 ` Phillip Susi 0 siblings, 1 reply; 19+ messages in thread From: Ted Ts'o @ 2011-03-15 0:14 UTC (permalink / raw) To: Phillip Susi; +Cc: Eric Sandeen, linux-ext4@vger.kernel.org On Mon, Mar 14, 2011 at 07:43:57PM -0400, Phillip Susi wrote: > > what if a process opens a directory, starts calling readdir, pauses in > > the middle, and then holds onto it for days, weeks, or months? > > The same thing that happened before htree? No, because the readdir() semantics are exactly set up to mirror a linear directory tree. > > It's not hard to provide library routines that do the right thing, and > > I have written an LD_PRELOAD which intercepts opendir() and readdir() > > calls and does the sorting in userspace. Perhaps the right answer is > > getting this into libc, but I have exactly two words for you: "Uhlrich > > Drepper". > > Wouldn't it be better to just have readdir() use the main directory, > which presumably is in a more sane ordering, instead of the htree? That > way you don't have to burn cpu and ram sorting on every opendir(). The htree is embedded into directory blocks (they appear to ext2 and older kernels as deleted directory entries). The leaf blocks are in the "main directory" as well; there is no separate htree. The reason why we have to traverse the directory tree in htree order is because the POSIX requirements of how readdir() works in the face of file deletes and creations, and what needs to happen if a leaf block needs to be split. Even if the readdir() started three months ago, if in the intervening time, leaf nodes have been split, readdir() is not allowed to return the same file twice. > Also, I have checked some smaller directories and lsattr reports they > are NOT using indexing, yet still display poor correlation. Well, if the file system has been around for a long time, and there are lots of "holes" in the inode allocation bitmap, it can happen that even without indexing. As another example, if you have a large maildir directory w/o indexing, and files get removed, deleted, etc., over time the order of the directory entries will have very little to do with the inode number. That's why programs like mutt sort the directory entries by inode number. - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 0:14 ` Ted Ts'o @ 2011-03-15 14:01 ` Phillip Susi 2011-03-15 14:33 ` Rogier Wolff 2011-03-15 17:08 ` Ted Ts'o 0 siblings, 2 replies; 19+ messages in thread From: Phillip Susi @ 2011-03-15 14:01 UTC (permalink / raw) To: Ted Ts'o; +Cc: Eric Sandeen, linux-ext4@vger.kernel.org On 3/14/2011 8:14 PM, Ted Ts'o wrote: > The reason why we have to traverse the directory tree in htree order > is because the POSIX requirements of how readdir() works in the face > of file deletes and creations, and what needs to happen if a leaf > block needs to be split. Even if the readdir() started three months > ago, if in the intervening time, leaf nodes have been split, readdir() > is not allowed to return the same file twice. This would also be fixed by having readdir() traverse the linear directory entries rather than the htree. > Well, if the file system has been around for a long time, and there > are lots of "holes" in the inode allocation bitmap, it can happen that > even without indexing. Why is that? Sure, if the inode table is full of small holes I can see them not being allocated sequentially, but why don't they tend to at least be allocated in ascending order? > As another example, if you have a large maildir directory w/o > indexing, and files get removed, deleted, etc., over time the order of > the directory entries will have very little to do with the inode > number. That's why programs like mutt sort the directory entries by > inode number. Is this what e2fsck -D fixes? Does it rewrite the directory entries in inode order? I've been toying with the idea of adding directory optimization support to e2defrag. To try and clarify this point a bit, are you saying that applications like tar and rsync should be patched to sort the directory by inode number, rather than it being the job of the fs to return entries in a good order? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 14:01 ` Phillip Susi @ 2011-03-15 14:33 ` Rogier Wolff 2011-03-15 14:36 ` Ric Wheeler 2011-03-15 17:08 ` Ted Ts'o 1 sibling, 1 reply; 19+ messages in thread From: Rogier Wolff @ 2011-03-15 14:33 UTC (permalink / raw) To: Phillip Susi; +Cc: Ted Ts'o, Eric Sandeen, linux-ext4@vger.kernel.org On Tue, Mar 15, 2011 at 10:01:24AM -0400, Phillip Susi wrote: > To try and clarify this point a bit, are you saying that applications > like tar and rsync should be patched to sort the directory by inode > number, rather than it being the job of the fs to return entries in a > good order? IMHO, it is the job of the FS to perform well in common cases. Roger. -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 ** ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 ** *-- BitWizard writes Linux device drivers for any device you may have! --* Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. Does it sit on the couch all day? Is it unemployed? Please be specific! Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 14:33 ` Rogier Wolff @ 2011-03-15 14:36 ` Ric Wheeler 0 siblings, 0 replies; 19+ messages in thread From: Ric Wheeler @ 2011-03-15 14:36 UTC (permalink / raw) To: Rogier Wolff Cc: Phillip Susi, Ted Ts'o, Eric Sandeen, linux-ext4@vger.kernel.org On 03/15/2011 10:33 AM, Rogier Wolff wrote: > On Tue, Mar 15, 2011 at 10:01:24AM -0400, Phillip Susi wrote: >> To try and clarify this point a bit, are you saying that applications >> like tar and rsync should be patched to sort the directory by inode >> number, rather than it being the job of the fs to return entries in a >> good order? > IMHO, it is the job of the FS to perform well in common cases. > > Roger. > Some file systems do this well. The thread here is more focused on dealing with issues for file systems that handle this poorly :) Ric ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 14:01 ` Phillip Susi 2011-03-15 14:33 ` Rogier Wolff @ 2011-03-15 17:08 ` Ted Ts'o 2011-03-15 19:08 ` Phillip Susi 1 sibling, 1 reply; 19+ messages in thread From: Ted Ts'o @ 2011-03-15 17:08 UTC (permalink / raw) To: Phillip Susi; +Cc: Eric Sandeen, linux-ext4@vger.kernel.org On Tue, Mar 15, 2011 at 10:01:24AM -0400, Phillip Susi wrote: > On 3/14/2011 8:14 PM, Ted Ts'o wrote: > > The reason why we have to traverse the directory tree in htree order > > is because the POSIX requirements of how readdir() works in the face > > of file deletes and creations, and what needs to happen if a leaf > > block needs to be split. Even if the readdir() started three months > > ago, if in the intervening time, leaf nodes have been split, readdir() > > is not allowed to return the same file twice. > > This would also be fixed by having readdir() traverse the linear > directory entries rather than the htree. No, because the directory blocks are the leaf nodes, and in the case of a node split, we need to copy half of the directory entries in one block, and move it to a newly allocated block. If readdir() was traversing the linear directory entries, and had already traversed that directory block that needs to split, then you'll return those directory entries that got copied into a new leaf (i.e., new directory block) a second time. > > Well, if the file system has been around for a long time, and there > > are lots of "holes" in the inode allocation bitmap, it can happen that > > even without indexing. > > Why is that? Sure, if the inode table is full of small holes I can see > them not being allocated sequentially, but why don't they tend to at > least be allocated in ascending order? Unless some files get deleted in between. Now depending on the "holes" in the directory blocks, where the new directory entries are added, even in the non-htree case, could either be wherever an empty directory entry could be found, or in the worst case, we might need to allocate a new block and that new directory entry gets added to the end of the block. I suggest that you try some experiments, using both dir_index and non-dir_index file systems, and then looking at the directory using the debugfs "ls" and "htree_dump" commands. Either believe me, or learn how things really work. :-) > > As another example, if you have a large maildir directory w/o > > indexing, and files get removed, deleted, etc., over time the order of > > the directory entries will have very little to do with the inode > > number. That's why programs like mutt sort the directory entries by > > inode number. > > Is this what e2fsck -D fixes? Does it rewrite the directory entries in > inode order? I've been toying with the idea of adding directory > optimization support to e2defrag. Yes, e2fsck -D will optimize directory entries, as best it can. In the non-dir_index case, it will sort the directory entries so they are in inode order, and squeeze out "holes" in the directory blocks, thus compatifying the directory. In the dir_index case, it will optimize the hash_tree of the directory as much as possible. > To try and clarify this point a bit, are you saying that applications > like tar and rsync should be patched to sort the directory by inode > number, rather than it being the job of the fs to return entries in a > good order? That's correct. I wish the file system could do this for the applications, but there are some corner cases involving very large directories, and programs that use readdir() but then stall for days, months, and years, that make this very hard. I suppose we could allocate up to some tunable amount worth of directory space, say 64k or 128k, and do the sorting inside the kernel. We then have to live with the fact that each badly behaved program which calls opendir(), and then a single readdir(), and then stops, will consume 128k of non-swappable kernel memory until the process gets killed. A process which does this thousands of times could potentially carry out a resource exhaustion attack on the system. Which we could then try to patch over, by say creating a new resource limit of the number of directories a process can keep open at a time, but then the attacker could just fork some additional child processes.... If someone wants to try to solve this problem, patches that are clean and which don't open up the kernel to a nasty DOS attack are welcome. - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 17:08 ` Ted Ts'o @ 2011-03-15 19:08 ` Phillip Susi 2011-03-16 1:50 ` Ted Ts'o 0 siblings, 1 reply; 19+ messages in thread From: Phillip Susi @ 2011-03-15 19:08 UTC (permalink / raw) To: Ted Ts'o; +Cc: Eric Sandeen, linux-ext4@vger.kernel.org On 3/15/2011 1:08 PM, Ted Ts'o wrote: > No, because the directory blocks are the leaf nodes, and in the case > of a node split, we need to copy half of the directory entries in one > block, and move it to a newly allocated block. If readdir() was > traversing the linear directory entries, and had already traversed > that directory block that needs to split, then you'll return those > directory entries that got copied into a new leaf (i.e., new directory > block) a second time. When you split the htree node, aren't you just moving around the "deleted entries"? So the normal names remain in the same order so readdir() doesn't have a problem when it is ignoring the htree entries and just walking the normal names? Also, how do you deal with this when you do end up re balancing the htree during a readdir()? I would think that keeping that straight would be much more difficult than handling the problem with linear directory entries. Why was the htree hidden inside the normal directory structure anyway? > Unless some files get deleted in between. Now depending on the > "holes" in the directory blocks, where the new directory entries are > added, even in the non-htree case, could either be wherever an empty > directory entry could be found, or in the worst case, we might need to > allocate a new block and that new directory entry gets added to the > end of the block. Right, but on an otherwise idle system, when you make all the files at once via rsync or untaring an archive, this shouldn't happen and they should be ( generally ) in ascending order, shouldn't they? > I suggest that you try some experiments, using both dir_index and > non-dir_index file systems, and then looking at the directory using > the debugfs "ls" and "htree_dump" commands. Either believe me, or > learn how things really work. :-) Now THAT sounds interesting. Is this documented somewhere? Also, why can't chattr set/clear the 'I' flag? Is it just a runtime combersome thing? So setting and clearing the flag with debugfs followed by a fsck should do the trick? And when is it automatically enabled? > I suppose we could allocate up to some tunable amount worth of > directory space, say 64k or 128k, and do the sorting inside the > kernel. We then have to live with the fact that each badly behaved > program which calls opendir(), and then a single readdir(), and then > stops, will consume 128k of non-swappable kernel memory until the > process gets killed. A process which does this thousands of times > could potentially carry out a resource exhaustion attack on the > system. Which we could then try to patch over, by say creating a new > resource limit of the number of directories a process can keep open at > a time, but then the attacker could just fork some additional child > processes.... I think you are right in that if sorting is to be done at opendir()/readdir() time, then it should be done in libc, not the kernel, but it would be even better if the fs made some effort store the entries in a good order so no sorting is needed at all. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 19:08 ` Phillip Susi @ 2011-03-16 1:50 ` Ted Ts'o 0 siblings, 0 replies; 19+ messages in thread From: Ted Ts'o @ 2011-03-16 1:50 UTC (permalink / raw) To: Phillip Susi; +Cc: Eric Sandeen, linux-ext4@vger.kernel.org On Tue, Mar 15, 2011 at 03:08:53PM -0400, Phillip Susi wrote: > > When you split the htree node, aren't you just moving around the > "deleted entries"? So the normal names remain in the same order so > readdir() doesn't have a problem when it is ignoring the htree entries > and just walking the normal names? No, because the leaf entries of the htree are the actual directory entries in the directory block. > Why was the htree hidden inside the normal directory structure anyway? So that we had full backwards compatibility with ext2. I had planned this feature in advance, and ext2 (and pre-htree ext3 implementations) would clear the htree flag if they tried to modify an htree directory. Since the leaf nodes (which are the ones that get split) are normal directory blocks, and interior nodes of the tree are directory blocks that have what appears to ext2 to be a deleted directory entry covering the entire block, it's fully backwards compatible with ext2/older ext3 systems. > > Unless some files get deleted in between. Now depending on the > > "holes" in the directory blocks, where the new directory entries are > > added, even in the non-htree case, could either be wherever an empty > > directory entry could be found, or in the worst case, we might need to > > allocate a new block and that new directory entry gets added to the > > end of the block. > > Right, but on an otherwise idle system, when you make all the files at > once via rsync or untaring an archive, this shouldn't happen and they > should be ( generally ) in ascending order, shouldn't they? If the directory is freshly created, yes. But over time, if you are adding and deleting files, such as a Maildir directory, this will not be the case forever. > > I suggest that you try some experiments, using both dir_index and > > non-dir_index file systems, and then looking at the directory using > > the debugfs "ls" and "htree_dump" commands. Either believe me, or > > learn how things really work. :-) > > Now THAT sounds interesting. Is this documented somewhere? The debugfs man page has some documentation. > Also, why can't chattr set/clear the 'I' flag? Is it just a runtime > combersome thing? So setting and clearing the flag with debugfs > followed by a fsck should do the trick? And when is it automatically > enabled? You can't clear the 'I' flag because I didn't want to bother getting the locking right so that it would be safe. And turning off the 'I' flag isn't going to help, since the directory entries are scrambled anyway --- again, because the leaf nodes of the htree *are* normal directory blocks. Turning the 'I' flag on would require reindexing the whole directory, which would be a major headache. You'd have to lock out all directory accesses, and then do a full copy operation --- and remember, a directory could be potentially many megabytes, and kernel memory is non-swappable. > I think you are right in that if sorting is to be done at > opendir()/readdir() time, then it should be done in libc, not the > kernel, but it would be even better if the fs made some effort store the > entries in a good order so no sorting is needed at all. The problem is that we have to store them in hash tree order so the lookups happen correctly. We could have done what JFS does, which is to keep three separate b-trees for its directories, and where every single time you add or modify a file, you have to update multiple btrees. But, (a) this would have required an non-backwards compatible file system format change from ext2/older ext3 file systems, and (b) the extra b-trees updates are their own source of overhead. - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-14 20:37 ` Eric Sandeen 2011-03-14 20:52 ` Phillip Susi @ 2011-03-15 7:59 ` Florian Weimer 2011-03-15 11:06 ` Theodore Tso 1 sibling, 1 reply; 19+ messages in thread From: Florian Weimer @ 2011-03-15 7:59 UTC (permalink / raw) To: Eric Sandeen; +Cc: Phillip Susi, linux-ext4@vger.kernel.org * Eric Sandeen: > No, because htree (dir_index) dirs returns names in hash-value > order, not inode number order. i.e. "at random." > > As you say, sorting by inode number will work much better... The dpkg folks tested this and it turns out that you get better results if you open the file and use FIBMAP to get the first block number, and sort by that. You could sort by inode number before the open/fstat calls, but it does not seem to help much. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 7:59 ` Florian Weimer @ 2011-03-15 11:06 ` Theodore Tso 2011-03-15 11:23 ` Ric Wheeler 2011-03-15 13:33 ` Rogier Wolff 0 siblings, 2 replies; 19+ messages in thread From: Theodore Tso @ 2011-03-15 11:06 UTC (permalink / raw) To: Florian Weimer; +Cc: Eric Sandeen, Phillip Susi, linux-ext4@vger.kernel.org On Mar 15, 2011, at 3:59 AM, Florian Weimer wrote: > * Eric Sandeen: > >> No, because htree (dir_index) dirs returns names in hash-value >> order, not inode number order. i.e. "at random." >> >> As you say, sorting by inode number will work much better... > > The dpkg folks tested this and it turns out that you get better > results if you open the file and use FIBMAP to get the first block > number, and sort by that. You could sort by inode number before the > open/fstat calls, but it does not seem to help much. It depends on which problem you are trying to solve. If this is a cold cache situation, and the inode cache is empty, then sorting by inode number will help since otherwise you'll be seeking all over just to read in the inode structures. This is true for any kind of readdir+stat combination, whether it's ls -l, or du or readdir + FIBMAP (I'd recommend using FIEMAP these days, though). However, if you need to suck in the information for a large number of small files (such as all of the files in /var/lib/dpkg/info), then sure, sorting ont he block number can help reduce seeks on the data blocks side of things. So in an absolute cold cache situations, what I'd recommend is readdir, sort by inode, FIEMAP, sort by block, and then read in the dpkg files. Of course an RPM partisan might say, "it would help if you guys had used a real database instead of ab(using) the file system. And then the dpkg guys could complain about what happens when RPM has to deal with corrupted rpm database, and how this allows dpkg to use shell scripts to access their package information. Life is full of tradeoffs. -- Ted > > -- > Florian Weimer <fweimer@bfk.de> > BFK edv-consulting GmbH http://www.bfk.de/ > Kriegsstraße 100 tel: +49-721-96201-1 > D-76133 Karlsruhe fax: +49-721-96201-99 > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 11:06 ` Theodore Tso @ 2011-03-15 11:23 ` Ric Wheeler 2011-03-15 11:38 ` Theodore Tso 2011-03-15 13:33 ` Rogier Wolff 1 sibling, 1 reply; 19+ messages in thread From: Ric Wheeler @ 2011-03-15 11:23 UTC (permalink / raw) To: Theodore Tso Cc: Florian Weimer, Eric Sandeen, Phillip Susi, linux-ext4@vger.kernel.org On 03/15/2011 07:06 AM, Theodore Tso wrote: > On Mar 15, 2011, at 3:59 AM, Florian Weimer wrote: > >> * Eric Sandeen: >> >>> No, because htree (dir_index) dirs returns names in hash-value >>> order, not inode number order. i.e. "at random." >>> >>> As you say, sorting by inode number will work much better... >> The dpkg folks tested this and it turns out that you get better >> results if you open the file and use FIBMAP to get the first block >> number, and sort by that. You could sort by inode number before the >> open/fstat calls, but it does not seem to help much. > It depends on which problem you are trying to solve. If this is a cold > cache situation, and the inode cache is empty, then sorting by inode > number will help since otherwise you'll be seeking all over just to > read in the inode structures. This is true for any kind of readdir+stat > combination, whether it's ls -l, or du or readdir + FIBMAP (I'd > recommend using FIEMAP these days, though). > > However, if you need to suck in the information for a large number of > small files (such as all of the files in /var/lib/dpkg/info), then sure, sorting > ont he block number can help reduce seeks on the data blocks side of > things. > > So in an absolute cold cache situations, what I'd recommend is readdir, > sort by inode, FIEMAP, sort by block, and then read in the dpkg files. > Of course an RPM partisan might say, "it would help if you guys had > used a real database instead of ab(using) the file system. And then > the dpkg guys could complain about what happens when RPM has to > deal with corrupted rpm database, and how this allows dpkg to use > shell scripts to access their package information. Life is full of tradeoffs. > > -- Ted > I have tested both sorting techniques with very large directories. Most of the gain came with the simple sorting by inode number, but of course this relies on the file system allocation policy having a correlation between the inode numbers and layout (i.e., higher inode number correspond to higher block numbers). Note that you can get the inode number used in this sorting without doing any stat calls. Sorting by first block number also works well, but does have that extra syscall (probably two - open & fibmap?) per file. Ric ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 11:23 ` Ric Wheeler @ 2011-03-15 11:38 ` Theodore Tso 0 siblings, 0 replies; 19+ messages in thread From: Theodore Tso @ 2011-03-15 11:38 UTC (permalink / raw) To: Ric Wheeler Cc: Florian Weimer, Eric Sandeen, Phillip Susi, linux-ext4@vger.kernel.org On Mar 15, 2011, at 7:23 AM, Ric Wheeler wrote: > > I have tested both sorting techniques with very large directories. > > Most of the gain came with the simple sorting by inode number, but of course this relies on the file system allocation policy having a correlation between the inode numbers and layout (i.e., higher inode number correspond to higher block numbers). That's not just a matter of file system allocation policy, but how fragmented the free space is as far as inode numbers and block numbers might be. (Especially in the block group where /var/lib/dpkg/info lives, on Debian systems. That's because that directory is (ab)used as a database, with lots of files added and deleted, so over time it's pretty much inevitable that any link between directory order, inode numbers, and block numbers, would become weaker and weaker over time. Fortunately the overall database tends to fit into the dentry, inode, and page caches after the first dpkg operation...) > Note that you can get the inode number used in this sorting without doing any stat calls. Sure. And in the worst case, you would sort based on the inode number so that when you call open/FI[BE]MAP, the disk isn't seeking all over the place when you read in the inode structure... Then you need to sort by block numbers, and assuming that you have enough memory that all of /var/lib/dpkg/info/* fits in the inode cache, you will minimize seeking while you read the blocks into memory. -- Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 11:06 ` Theodore Tso 2011-03-15 11:23 ` Ric Wheeler @ 2011-03-15 13:33 ` Rogier Wolff 2011-03-15 17:18 ` Ted Ts'o 1 sibling, 1 reply; 19+ messages in thread From: Rogier Wolff @ 2011-03-15 13:33 UTC (permalink / raw) To: Theodore Tso Cc: Florian Weimer, Eric Sandeen, Phillip Susi, linux-ext4@vger.kernel.org On Tue, Mar 15, 2011 at 07:06:34AM -0400, Theodore Tso wrote: > So in an absolute cold cache situations, what I'd recommend is > readdir, sort by inode, FIEMAP, sort by block, and then read in the > dpkg files. Of course an RPM partisan might say, "it would help if > you guys had used a real database instead of ab(using) the file > system. And then the dpkg guys could complain about what happens > when RPM has to deal with corrupted rpm database, and how this > allows dpkg to use shell scripts to access their package > information. Life is full of tradeoffs. IMHO, the most important part is "up to and including the stat". It should be possible to get the directory, and inode info all inside the same "16Mb" part of the disk. This would result in (after a few seeks) the rest of the accesses coming from the disk's cache. This would mean that you should allocate directory blocks from the end PREVIOUS block group.... Roger. -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 ** ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 ** *-- BitWizard writes Linux device drivers for any device you may have! --* Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. Does it sit on the couch all day? Is it unemployed? Please be specific! Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large directories and poor order correlation 2011-03-15 13:33 ` Rogier Wolff @ 2011-03-15 17:18 ` Ted Ts'o 0 siblings, 0 replies; 19+ messages in thread From: Ted Ts'o @ 2011-03-15 17:18 UTC (permalink / raw) To: Rogier Wolff Cc: Florian Weimer, Eric Sandeen, Phillip Susi, linux-ext4@vger.kernel.org On Tue, Mar 15, 2011 at 02:33:27PM +0100, Rogier Wolff wrote: > IMHO, the most important part is "up to and including the stat". It > should be possible to get the directory, and inode info all inside the > same "16Mb" part of the disk. This would result in (after a few seeks) > the rest of the accesses coming from the disk's cache. It depends on your workload. In the case of dpkg, everything fits in cache (usually) so after the first operation this is no longer a concern. But all of the data blocks of /var/lib/dpkg/info/* is huge, since not using a real database means that a 4k block is consumed for 300 bytes of data, so fitting all of the data blocks in memory generally doesn't work, which is why the dpkg folks are sorting by block number. > This would mean that you should allocate directory blocks from the end > PREVIOUS block group.... We do something else, which is we group all directory blocks together at the beginning of each flex_bg. This tends to reduce free space fragmentation, and it helps to optimize for large files that are bigger than a block group, and where you want to have contiguous regions larger than a bg --- so breaking up the space every bg is not a great idea. Again, with general purpose file systems you can't just optimize for one thing, and life is full of tradeoffs. - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2011-03-16 1:50 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-14 20:24 Large directories and poor order correlation Phillip Susi 2011-03-14 20:37 ` Eric Sandeen 2011-03-14 20:52 ` Phillip Susi 2011-03-14 21:12 ` Eric Sandeen 2011-03-14 21:52 ` Ted Ts'o 2011-03-14 23:43 ` Phillip Susi 2011-03-15 0:14 ` Ted Ts'o 2011-03-15 14:01 ` Phillip Susi 2011-03-15 14:33 ` Rogier Wolff 2011-03-15 14:36 ` Ric Wheeler 2011-03-15 17:08 ` Ted Ts'o 2011-03-15 19:08 ` Phillip Susi 2011-03-16 1:50 ` Ted Ts'o 2011-03-15 7:59 ` Florian Weimer 2011-03-15 11:06 ` Theodore Tso 2011-03-15 11:23 ` Ric Wheeler 2011-03-15 11:38 ` Theodore Tso 2011-03-15 13:33 ` Rogier Wolff 2011-03-15 17:18 ` Ted Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).