From: Junio C Hamano <gitster@pobox.com>
To: Yannick Gingras <ygingras@ygingras.net>
Cc: git@vger.kernel.org
Subject: Re: On the many files problem
Date: Sat, 29 Dec 2007 11:27:31 -0800 [thread overview]
Message-ID: <7vabntb8t8.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: <87y7bdweca.fsf@enceladus.ygingras.net> (Yannick Gingras's message of "Sat, 29 Dec 2007 13:22:29 -0500")
Yannick Gingras <ygingras@ygingras.net> writes:
> Greetings Git hackers,
>
> No doubt, you guys must have discussed this problem before but I will
> pretend that I can't find the relevant threads in the archive because
> Marc's search is kind of crude.
>
> I'm coding an application that will potentially store quite a bunch of
> files in the same directory so I wondered how I should do it. I tried
> a few different files systems and I tried path hashing, that is,
> storing the file that hashes to d3b07384d113 in d/d3/d3b07384d113. As
> far as I can tell, that's what Git does. It turned out to be slower
> than anything except ext3 without dir_index.
We hash like d3/b07384d113, but your understanding of we do is
more or less right.
If we never introduced packed object storage, this issue may
have mattered and we might have looked into it further to
improve the loose object access performance. But in reality, no
sane git user would keep millions of loose objects unpacked.
And changing the layout would mean a backward incompatible
change for dumb transport clients. There is practically no
upside and are downsides to change it now.
Traditionally, avoiding large directories when dealing with a
large number of files by path hashing was a tried and proven
wisdom in many applications (e.g. web proxies, news servers).
Newer filesystems do have tricks to let you quickly access a
large number of files in a single directory, and that lessens
the need for the applications to play path hashing games.
That is a good thing, but if that trick makes the traditional
way of dealing with a large number of files _too costly_, it may
be striking the balance at a wrong point. That is favoring
newly written applications that assume that large directories
are Ok (or ones written by people who do not know the historical
behaviour of filesystems), by punishing existing practices too
heavily.
The person who is guilty of introducing the hashed loose object
store is intimately familiar with Linux. I do not speak for
him, but if I have to guess, the reason he originally chose the
path hashing was because he just followed the tradition, and he
did not want to make the system too dependent on Linux or a
particular feature of underlying filesystems.
prev parent reply other threads:[~2007-12-29 19:28 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-29 18:22 On the many files problem Yannick Gingras
2007-12-29 19:12 ` Linus Torvalds
2007-12-31 10:13 ` Yannick Gingras
2007-12-31 20:45 ` Linus Torvalds
2007-12-31 23:31 ` Martin Langhoff
2007-12-29 19:27 ` Junio C Hamano [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7vabntb8t8.fsf@gitster.siamese.dyndns.org \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=ygingras@ygingras.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).