* On the many files problem @ 2007-12-29 18:22 Yannick Gingras 2007-12-29 19:12 ` Linus Torvalds 2007-12-29 19:27 ` Junio C Hamano 0 siblings, 2 replies; 6+ messages in thread From: Yannick Gingras @ 2007-12-29 18:22 UTC (permalink / raw) To: git Greetings Git hackers, No doubt, you guys must have discussed this problem before but I will pretend that I can't find the relevant threads in the archive because Marc's search is kind of crude. I'm coding an application that will potentially store quite a bunch of files in the same directory so I wondered how I should do it. I tried a few different files systems and I tried path hashing, that is, storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113. As far as I can tell, that's what Git does. It turned out to be slower than anything except ext3 without dir_index. You can see my results and the benchmarking code that I used here: http://ygingras.net/b/2007/12/too-many-files:-reiser-fs-vs-hashed-paths Quick like that, I would be tempted to say that hashing paths always makes things slower but the Git development team includes people with really intimate knowledge of several file system implementations so I'm tempted to say that you guys know something that I don't. Can you describe how you hash the paths and what trick is done to ensure fast creating and access to the subdirectories? Is path hashing generally faster or are you trying to avoid problems for people using git on baroque file systems? Best regards, -- Yannick Gingras ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: On the many files problem 2007-12-29 18:22 On the many files problem Yannick Gingras @ 2007-12-29 19:12 ` Linus Torvalds 2007-12-31 10:13 ` Yannick Gingras 2007-12-29 19:27 ` Junio C Hamano 1 sibling, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2007-12-29 19:12 UTC (permalink / raw) To: Yannick Gingras; +Cc: git On Sat, 29 Dec 2007, Yannick Gingras wrote: > > I'm coding an application that will potentially store quite a bunch of > files in the same directory so I wondered how I should do it. I tried > a few different files systems and I tried path hashing, that is, > storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113. As > far as I can tell, that's what Git does. It turned out to be slower > than anything except ext3 without dir_index. Note that git doesn't hash paths because it *wants* to, but because it *needs* to. Basically, what git very fundamnentally wants as the basic operation, is to look up a SHA1 hashed object. > Quick like that, I would be tempted to say that hashing paths always > makes things slower but the Git development team includes people with > really intimate knowledge of several file system implementations so > I'm tempted to say that you guys know something that I don't. No, no. The fastest way to store a bunch of files is to basically: - make the filenames as small as possible This improves density, and thus performance. You'll get more files to fit in a smaller directory, and filename compares etc will be faster too. If you don't need hashes, but can do with smaller names (for example, your names are really just sequential, and you can use some base-64 encoding to make them smaller than numbers), you'll always be better off. - store them in a good final order. This is debatable, but depending on your load and the filesystem, it can be better to make sure that you create all files in order, because performance can plummet if you start removing files later. I suspect your benchmark *only* tested this case, but if you want to check odder cases, try creating a huge directory, and then deleting most files, and then adding a few new ones. Some filesystems will take a huge hit because they'll still scan the whole directory, even though it's mostly empty! (Also, a "readdir() + stat()" loop will often get *much* worse access patterns if you've mixed deletions and creations) - it's generally *much* more efficient to have one large file that you read and seek in than having many small ones, so if you can change your load so that you don't have tons of files at all, you'll probably be better off. Anyway, git is not a good thing to look at for "lots of files" performance if you have any options. The reasons git does what it does is: - as mentioned, the SHA1 hash is the fundamental unit of lookup, so the name *must* be the hash, since that's all we ever have! - If you don't have indexing, you really *really* don't want to have a flat file structure (as your benchmark found out). And git didn't want to assume you have indexing, since a lot of filesystems don't (I suspect a lot of ext3 installations _still_ don't enable it, for example, even if the capability is there!) So the first point explains the use of hashes for filenames, and the second point explains the use of a single level of indexing. But neither of them are about absolute performance per se, both are about basic requirements. For good performance, look at the git pack-files. Those are designed to be data-dense, and only a few files that can be opened and mapped just once. There, performance was one of the guiding goals (along with the ability to also use them for streaming, and still being *reasonably* simple, of course). Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: On the many files problem 2007-12-29 19:12 ` Linus Torvalds @ 2007-12-31 10:13 ` Yannick Gingras 2007-12-31 20:45 ` Linus Torvalds 2007-12-31 23:31 ` Martin Langhoff 0 siblings, 2 replies; 6+ messages in thread From: Yannick Gingras @ 2007-12-31 10:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Thanks to you and to Junio for your replies, Linus Torvalds <torvalds@linux-foundation.org> writes: > No, no. The fastest way to store a bunch of files is to basically: > > - make the filenames as small as possible > > This improves density, and thus performance. You'll get more files to > fit in a smaller directory, and filename compares etc will be faster > too. > > If you don't need hashes, but can do with smaller names (for example, > your names are really just sequential, and you can use some base-64 > encoding to make them smaller than numbers), you'll always be better > off. This is really interesting and I would not have suspected it. But it begs the question: why does Git use the base-16 hash instead of the base-64 hash? After your replies I took a serious look at Git's storage and there is indeed not that many loose objects in a typical repo: most is kept into packs. So I guess Git doesn't need that much density and keeping the filename in the format that is used in the UI probably helps power users. > - store them in a good final order. This is debatable, but depending on > your load and the filesystem, it can be better to make sure that you > create all files in order, because performance can plummet if you start > removing files later. > > I suspect your benchmark *only* tested this case, That is true. > but if you want to check odder cases, try creating a huge > directory, and then deleting most files, and then adding a few > new ones. Some filesystems will take a huge hit because they'll > still scan the whole directory, even though it's mostly empty! > > (Also, a "readdir() + stat()" loop will often get *much* worse access > patterns if you've mixed deletions and creations) This is something that will be interesting to benchmark later on. So, an application with a lot of turnaround, say a mail server, should delete and re-create the directories from time to time? I assume this is specific to some file system types. > - it's generally *much* more efficient to have one large file that you > read and seek in than having many small ones, so if you can change your > load so that you don't have tons of files at all, you'll probably be > better off. That makes a lot of sense. Thanks again for those clarifications. -- Yannick Gingras ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: On the many files problem 2007-12-31 10:13 ` Yannick Gingras @ 2007-12-31 20:45 ` Linus Torvalds 2007-12-31 23:31 ` Martin Langhoff 1 sibling, 0 replies; 6+ messages in thread From: Linus Torvalds @ 2007-12-31 20:45 UTC (permalink / raw) To: Yannick Gingras; +Cc: git On Mon, 31 Dec 2007, Yannick Gingras wrote: > > This is really interesting and I would not have suspected it. But it > begs the question: why does Git use the base-16 hash instead of the > base-64 hash? Because I was stupid, and I'ma lot more used to hex numbers than to base-64. Also, the way the original encoding worked, the SHA1 of the object was actually the SHA1 of the *compressed* object, so you could verify the integrity of the object writing by just doing a sha1sum .git/objects/4f/a7491b073032b57c7fcf28c9222a5fa7b3a6b9 and it would return 4fa7491b073032b57c7fcf28c9222a5fa7b3a6b9 if everything was good. That was a nice bonus in the first few days of git development, when it all was a set of very low-level object routines hung together with prayers and duct-tape. So consider it historical. It wasn't worth fixing, since it became obvious that the real fix would never be to try to make the individual files or filenames smaller. > > (Also, a "readdir() + stat()" loop will often get *much* worse access > > patterns if you've mixed deletions and creations) > > This is something that will be interesting to benchmark later on. So, > an application with a lot of turnaround, say a mail server, should > delete and re-create the directories from time to time? I assume this > is specific to some file system types. This is an issue only for certain filesystems, and it's also an issue only for certain access patterns. A mail server, for example, will seldom *scan* the directory. It will just open individual files by name. So it won't be hit by the "readdir+stat" issue, unless you actually do a "ls -l". (There are exceptions. Some mailbox formats use a file per email in a directory. And yes, they tend to suck from a performance angle). And you can avoid it. For example, on most unixish filesystems, you can get better IO access patterns by doing the readdir() into an array, then sorting it by inode number, and then doing the stat() in that order: that *often* (but not always - there's no guarantee what the inode number actually means) gives you better disk access patterns. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: On the many files problem 2007-12-31 10:13 ` Yannick Gingras 2007-12-31 20:45 ` Linus Torvalds @ 2007-12-31 23:31 ` Martin Langhoff 1 sibling, 0 replies; 6+ messages in thread From: Martin Langhoff @ 2007-12-31 23:31 UTC (permalink / raw) To: Yannick Gingras; +Cc: Linus Torvalds, git On Dec 31, 2007 11:13 PM, Yannick Gingras <ygingras@ygingras.net> wrote: > > but if you want to check odder cases, try creating a huge > > directory, and then deleting most files, and then adding a few > > new ones. Some filesystems will take a huge hit because they'll > > still scan the whole directory, even though it's mostly empty! > > > > (Also, a "readdir() + stat()" loop will often get *much* worse access > > patterns if you've mixed deletions and creations) > > This is something that will be interesting to benchmark later on. So, > an application with a lot of turnaround, say a mail server, should > delete and re-create the directories from time to time? I assume this > is specific to some file system types. This is indeed the case. Directories with a lot of movement get fragmented on most FSs -- ext3 is a very bad case for this -- and there are no "directory defrag" tools other than regenarating them. The "Maildir" storage used for many IMAP servers these days shows the problem. This (longish) threads has some interesting tidbits on getdents() and directory fragmentation. http://kerneltrap.org/mailarchive/git/2007/1/7/235215 cheers, m ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: On the many files problem 2007-12-29 18:22 On the many files problem Yannick Gingras 2007-12-29 19:12 ` Linus Torvalds @ 2007-12-29 19:27 ` Junio C Hamano 1 sibling, 0 replies; 6+ messages in thread From: Junio C Hamano @ 2007-12-29 19:27 UTC (permalink / raw) To: Yannick Gingras; +Cc: git Yannick Gingras <ygingras@ygingras.net> writes: > Greetings Git hackers, > > No doubt, you guys must have discussed this problem before but I will > pretend that I can't find the relevant threads in the archive because > Marc's search is kind of crude. > > I'm coding an application that will potentially store quite a bunch of > files in the same directory so I wondered how I should do it. I tried > a few different files systems and I tried path hashing, that is, > storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113. As > far as I can tell, that's what Git does. It turned out to be slower > than anything except ext3 without dir_index. We hash like d3/b07384d113, but your understanding of we do is more or less right. If we never introduced packed object storage, this issue may have mattered and we might have looked into it further to improve the loose object access performance. But in reality, no sane git user would keep millions of loose objects unpacked. And changing the layout would mean a backward incompatible change for dumb transport clients. There is practically no upside and are downsides to change it now. Traditionally, avoiding large directories when dealing with a large number of files by path hashing was a tried and proven wisdom in many applications (e.g. web proxies, news servers). Newer filesystems do have tricks to let you quickly access a large number of files in a single directory, and that lessens the need for the applications to play path hashing games. That is a good thing, but if that trick makes the traditional way of dealing with a large number of files _too costly_, it may be striking the balance at a wrong point. That is favoring newly written applications that assume that large directories are Ok (or ones written by people who do not know the historical behaviour of filesystems), by punishing existing practices too heavily. The person who is guilty of introducing the hashed loose object store is intimately familiar with Linux. I do not speak for him, but if I have to guess, the reason he originally chose the path hashing was because he just followed the tradition, and he did not want to make the system too dependent on Linux or a particular feature of underlying filesystems. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-12-31 23:31 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-12-29 18:22 On the many files problem Yannick Gingras 2007-12-29 19:12 ` Linus Torvalds 2007-12-31 10:13 ` Yannick Gingras 2007-12-31 20:45 ` Linus Torvalds 2007-12-31 23:31 ` Martin Langhoff 2007-12-29 19:27 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).