From: Nicolas Pitre <nico@cam.org>
To: Jaap Suter <git@jaapsuter.com>
Cc: git@vger.kernel.org
Subject: Re: Storing Large Flat Namespaces
Date: Mon, 08 Sep 2008 09:58:43 -0400 (EDT) [thread overview]
Message-ID: <alpine.LFD.1.10.0809080923250.23787@xanadu.home> (raw)
In-Reply-To: <21d738060809071858p703149ccmbec0276ad4ad8f88@mail.gmail.com>
On Sun, 7 Sep 2008, Jaap Suter wrote:
> Hello,
>
> I'm investigating the possibility of using Git to store a large flat
> namespace. As an example, imagine a single directory containing
> thousands or millions of files, each named using a 16-byte guid,
> evenly distributed.
>
> I'm aware that the Git object model makes various trade-offs,
> typically in favor of managing source-tree layouts - which it does
> extremely well. However, perhaps it is possible to carry some of Git's
> features as a content revision tracker over to other storage
> applications?
>
> Currently, tree objects for large flat directories are quite large.
> Doing a git-init, git-add, git-push on a flat directory with 10,000
> files creates a tree object that is 24 kilobytes compressed. Any
> change to a single file would create a whole new tree object, 24
> kilobytes every time.
Sure. But as soon as you repack that repository, there will be only one
24-kilobyte tree object for the latest revision and older revisions will
have their tree object stored as a delta against the latest one, which
should be merely the size of the changed entry only.
As to the idea of splitting a large flat directory namespace into
sub-trees... that would only help in the commit case which isn't that
interesting since commits are done much less frequent than, say,
directory walking. In the later case, you might end up inflating less
object data for files at the beginning of the directory but more for
files towards the end, which on average won't be a gain. And in all
cases you're looking up more objects. And, to benefit from any
advantage left if any, you'd need to do significant surgery in the tree
walking code, otherwise simply reconstructing the current object format
from subtree objects won't bring any advantage over the current packed
format using deltas.
The pack V4 should in theory be much better with large directories since
the design of its tree object representation would allow walking partial
tree by recursively following deltas making a complete tree walk almost
linear in terms of processing and data touched. But that too requires a
major surgery of the tree walking code (the main reason holding me back
from rushing into any pack v4 work at the moment). But pack v4 won't
change anything in the commit case -- you would still have to repack in
order to benefit from it.
Nicolas
prev parent reply other threads:[~2008-09-08 13:59 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-08 1:58 Storing Large Flat Namespaces Jaap Suter
2008-09-08 4:26 ` Jaap Suter
2008-09-08 13:58 ` Nicolas Pitre [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.1.10.0809080923250.23787@xanadu.home \
--to=nico@cam.org \
--cc=git@jaapsuter.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).