On the many files problem

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* On the many files problem
@ 2007-12-29 18:22 Yannick Gingras
  2007-12-29 19:12 ` Linus Torvalds
  2007-12-29 19:27 ` Junio C Hamano
  0 siblings, 2 replies; 6+ messages in thread
From: Yannick Gingras @ 2007-12-29 18:22 UTC (permalink / raw)
  To: git

Greetings Git hackers,

No doubt, you guys must have discussed this problem before but I will
pretend that I can't find the relevant threads in the archive because
Marc's search is kind of crude.

I'm coding an application that will potentially store quite a bunch of
files in the same directory so I wondered how I should do it.  I tried
a few different files systems and I tried path hashing, that is,
storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113.  As
far as I can tell, that's what Git does.  It turned out to be slower
than anything except ext3 without dir_index.  You can see my results
and the benchmarking code that I used here:

  http://ygingras.net/b/2007/12/too-many-files:-reiser-fs-vs-hashed-paths

Quick like that, I would be tempted to say that hashing paths always
makes things slower but the Git development team includes people with
really intimate knowledge of several file system implementations so
I'm tempted to say that you guys know something that I don't.

Can you describe how you hash the paths and what trick is done to
ensure fast creating and access to the subdirectories?  Is path
hashing generally faster or are you trying to avoid problems for
people using git on baroque file systems?

Best regards, 

-- 
Yannick Gingras

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: On the many files problem
  2007-12-29 18:22 On the many files problem Yannick Gingras
@ 2007-12-29 19:12 ` Linus Torvalds
  2007-12-31 10:13   ` Yannick Gingras
  2007-12-29 19:27 ` Junio C Hamano
  1 sibling, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2007-12-29 19:12 UTC (permalink / raw)
  To: Yannick Gingras; +Cc: git

On Sat, 29 Dec 2007, Yannick Gingras wrote:
> 
> I'm coding an application that will potentially store quite a bunch of
> files in the same directory so I wondered how I should do it.  I tried
> a few different files systems and I tried path hashing, that is,
> storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113.  As
> far as I can tell, that's what Git does.  It turned out to be slower
> than anything except ext3 without dir_index.

Note that git doesn't hash paths because it *wants* to, but because it 
*needs* to. 

Basically, what git very fundamnentally wants as the basic operation, is 
to look up a SHA1 hashed object. 

> Quick like that, I would be tempted to say that hashing paths always
> makes things slower but the Git development team includes people with
> really intimate knowledge of several file system implementations so
> I'm tempted to say that you guys know something that I don't.

No, no. The fastest way to store a bunch of files is to basically:

 - make the filenames as small as possible

   This improves density, and thus performance. You'll get more files to 
   fit in a smaller directory, and filename compares etc will be faster 
   too.

   If you don't need hashes, but can do with smaller names (for example, 
   your names are really just sequential, and you can use some base-64 
   encoding to make them smaller than numbers), you'll always be better 
   off.

 - store them in a good final order. This is debatable, but depending on 
   your load and the filesystem, it can be better to make sure that you 
   create all files in order, because performance can plummet if you start 
   removing files later.

   I suspect your benchmark *only* tested this case, but if you want to 
   check odder cases, try creating a huge directory, and then deleting 
   most files, and then adding a few new ones. Some filesystems will take 
   a huge hit because they'll still scan the whole directory, even though 
   it's mostly empty!

   (Also, a "readdir() + stat()" loop will often get *much* worse access 
   patterns if you've mixed deletions and creations)

 - it's generally *much* more efficient to have one large file that you 
   read and seek in than having many small ones, so if you can change your 
   load so that you don't have tons of files at all, you'll probably be 
   better off.

Anyway, git is not a good thing to look at for "lots of files" performance 
if you have any options. The reasons git does what it does is:

 - as mentioned, the SHA1 hash is the fundamental unit of lookup, so the 
   name *must* be the hash, since that's all we ever have!

 - If you don't have indexing, you really *really* don't want to have a 
   flat file structure (as your benchmark found out). And git didn't want 
   to assume you have indexing, since a lot of filesystems don't (I 
   suspect a lot of ext3 installations _still_ don't enable it, for 
   example, even if the capability is there!)

So the first point explains the use of hashes for filenames, and the 
second point explains the use of a single level of indexing. But neither 
of them are about absolute performance per se, both are about basic 
requirements.

For good performance, look at the git pack-files. Those are designed to be 
data-dense, and only a few files that can be opened and mapped just once. 
There, performance was one of the guiding goals (along with the ability to 
also use them for streaming, and still being *reasonably* simple, of 
course).

			Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: On the many files problem
  2007-12-29 18:22 On the many files problem Yannick Gingras
  2007-12-29 19:12 ` Linus Torvalds
@ 2007-12-29 19:27 ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2007-12-29 19:27 UTC (permalink / raw)
  To: Yannick Gingras; +Cc: git

Yannick Gingras <ygingras@ygingras.net> writes:

> Greetings Git hackers,
>
> No doubt, you guys must have discussed this problem before but I will
> pretend that I can't find the relevant threads in the archive because
> Marc's search is kind of crude.
>
> I'm coding an application that will potentially store quite a bunch of
> files in the same directory so I wondered how I should do it.  I tried
> a few different files systems and I tried path hashing, that is,
> storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113.  As
> far as I can tell, that's what Git does.  It turned out to be slower
> than anything except ext3 without dir_index.

We hash like d3/b07384d113, but your understanding of we do is
more or less right.

If we never introduced packed object storage, this issue may
have mattered and we might have looked into it further to
improve the loose object access performance.  But in reality, no
sane git user would keep millions of loose objects unpacked.
And changing the layout would mean a backward incompatible
change for dumb transport clients.  There is practically no
upside and are downsides to change it now.

Traditionally, avoiding large directories when dealing with a
large number of files by path hashing was a tried and proven
wisdom in many applications (e.g. web proxies, news servers).
Newer filesystems do have tricks to let you quickly access a
large number of files in a single directory, and that lessens
the need for the applications to play path hashing games.

That is a good thing, but if that trick makes the traditional
way of dealing with a large number of files _too costly_, it may
be striking the balance at a wrong point.  That is favoring
newly written applications that assume that large directories
are Ok (or ones written by people who do not know the historical
behaviour of filesystems), by punishing existing practices too
heavily.

The person who is guilty of introducing the hashed loose object
store is intimately familiar with Linux.  I do not speak for
him, but if I have to guess, the reason he originally chose the
path hashing was because he just followed the tradition, and he
did not want to make the system too dependent on Linux or a
particular feature of underlying filesystems.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: On the many files problem
  2007-12-29 19:12 ` Linus Torvalds
@ 2007-12-31 10:13   ` Yannick Gingras
  2007-12-31 20:45     ` Linus Torvalds
  2007-12-31 23:31     ` Martin Langhoff
  0 siblings, 2 replies; 6+ messages in thread
From: Yannick Gingras @ 2007-12-31 10:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git


Thanks to you and to Junio for your replies, 

Linus Torvalds <torvalds@linux-foundation.org> writes:
> No, no. The fastest way to store a bunch of files is to basically:
>
>  - make the filenames as small as possible
>
>    This improves density, and thus performance. You'll get more files to 
>    fit in a smaller directory, and filename compares etc will be faster 
>    too.
>
>    If you don't need hashes, but can do with smaller names (for example, 
>    your names are really just sequential, and you can use some base-64 
>    encoding to make them smaller than numbers), you'll always be better 
>    off.

This is really interesting and I would not have suspected it.  But it
begs the question: why does Git use the base-16 hash instead of the
base-64 hash?  

After your replies I took a serious look at Git's storage and there is
indeed not that many loose objects in a typical repo: most is kept
into packs.  So I guess Git doesn't need that much density and keeping
the filename in the format that is used in the UI probably helps power
users.

>  - store them in a good final order. This is debatable, but depending on 
>    your load and the filesystem, it can be better to make sure that you 
>    create all files in order, because performance can plummet if you start 
>    removing files later.
>
>    I suspect your benchmark *only* tested this case, 

That is true.

>    but if you want to check odder cases, try creating a huge
>    directory, and then deleting most files, and then adding a few
>    new ones. Some filesystems will take a huge hit because they'll
>    still scan the whole directory, even though it's mostly empty!
>
>    (Also, a "readdir() + stat()" loop will often get *much* worse access 
>    patterns if you've mixed deletions and creations)

This is something that will be interesting to benchmark later on.  So,
an application with a lot of turnaround, say a mail server, should
delete and re-create the directories from time to time?  I assume this
is specific to some file system types.

>  - it's generally *much* more efficient to have one large file that you 
>    read and seek in than having many small ones, so if you can change your 
>    load so that you don't have tons of files at all, you'll probably be 
>    better off. 

That makes a lot of sense.  

Thanks again for those clarifications.

-- 
Yannick Gingras

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: On the many files problem
  2007-12-31 10:13   ` Yannick Gingras
@ 2007-12-31 20:45     ` Linus Torvalds
  2007-12-31 23:31     ` Martin Langhoff
  1 sibling, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2007-12-31 20:45 UTC (permalink / raw)
  To: Yannick Gingras; +Cc: git

On Mon, 31 Dec 2007, Yannick Gingras wrote:
> 
> This is really interesting and I would not have suspected it.  But it
> begs the question: why does Git use the base-16 hash instead of the
> base-64 hash?  

Because I was stupid, and I'ma lot more used to hex numbers than to 
base-64.

Also, the way the original encoding worked, the SHA1 of the object was 
actually the SHA1 of the *compressed* object, so you could verify the 
integrity of the object writing by just doing a

	sha1sum .git/objects/4f/a7491b073032b57c7fcf28c9222a5fa7b3a6b9

and it would return 4fa7491b073032b57c7fcf28c9222a5fa7b3a6b9 if everything 
was good. That was a nice bonus in the first few days of git development, 
when it all was a set of very low-level object routines hung together with 
prayers and duct-tape.

So consider it historical. It wasn't worth fixing, since it became obvious 
that the real fix would never be to try to make the individual files or 
filenames smaller.

> >    (Also, a "readdir() + stat()" loop will often get *much* worse access 
> >    patterns if you've mixed deletions and creations)
> 
> This is something that will be interesting to benchmark later on.  So,
> an application with a lot of turnaround, say a mail server, should
> delete and re-create the directories from time to time?  I assume this
> is specific to some file system types.

This is an issue only for certain filesystems, and it's also an issue only 
for certain access patterns.

A mail server, for example, will seldom *scan* the directory. It will just 
open individual files by name. So it won't be hit by the "readdir+stat" 
issue, unless you actually do a "ls -l".

(There are exceptions. Some mailbox formats use a file per email in a 
directory. And yes, they tend to suck from a performance angle).

And you can avoid it. For example, on most unixish filesystems, you can 
get better IO access patterns by doing the readdir() into an array, then 
sorting it by inode number, and then doing the stat() in that order: that 
*often* (but not always - there's no guarantee what the inode number 
actually means) gives you better disk access patterns.

			Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: On the many files problem
  2007-12-31 10:13   ` Yannick Gingras
  2007-12-31 20:45     ` Linus Torvalds
@ 2007-12-31 23:31     ` Martin Langhoff
  1 sibling, 0 replies; 6+ messages in thread
From: Martin Langhoff @ 2007-12-31 23:31 UTC (permalink / raw)
  To: Yannick Gingras; +Cc: Linus Torvalds, git

On Dec 31, 2007 11:13 PM, Yannick Gingras <ygingras@ygingras.net> wrote:
> >    but if you want to check odder cases, try creating a huge
> >    directory, and then deleting most files, and then adding a few
> >    new ones. Some filesystems will take a huge hit because they'll
> >    still scan the whole directory, even though it's mostly empty!
> >
> >    (Also, a "readdir() + stat()" loop will often get *much* worse access
> >    patterns if you've mixed deletions and creations)
>
> This is something that will be interesting to benchmark later on.  So,
> an application with a lot of turnaround, say a mail server, should
> delete and re-create the directories from time to time?  I assume this
> is specific to some file system types.

This is indeed the case. Directories with a lot of movement get
fragmented on most FSs -- ext3 is a very bad case for this -- and
there are no "directory defrag" tools other than regenarating them.
The "Maildir" storage used for many IMAP servers these days shows the
problem.

This (longish) threads has some interesting tidbits on getdents() and
directory fragmentation.
http://kerneltrap.org/mailarchive/git/2007/1/7/235215

cheers,


m

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-12-31 23:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-29 18:22 On the many files problem Yannick Gingras
2007-12-29 19:12 ` Linus Torvalds
2007-12-31 10:13   ` Yannick Gingras
2007-12-31 20:45     ` Linus Torvalds
2007-12-31 23:31     ` Martin Langhoff
2007-12-29 19:27 ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).