Storing Large Flat Namespaces

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Storing Large Flat Namespaces
@ 2008-09-08  1:58 Jaap Suter
  2008-09-08  4:26 ` Jaap Suter
  2008-09-08 13:58 ` Nicolas Pitre
  0 siblings, 2 replies; 3+ messages in thread
From: Jaap Suter @ 2008-09-08  1:58 UTC (permalink / raw)
  To: git

Hello,

I'm investigating the possibility of using Git to store a large flat
namespace. As an example, imagine a single directory containing
thousands or millions of files, each named using a 16-byte guid,
evenly distributed.

I'm aware that the Git object model makes various trade-offs,
typically in favor of managing source-tree layouts - which it does
extremely well. However, perhaps it is possible to carry some of Git's
features as a content revision tracker over to other storage
applications?

Currently, tree objects for large flat directories are quite large.
Doing a git-init, git-add, git-push on a flat directory with 10,000
files creates a tree object that is 24 kilobytes compressed. Any
change to a single file would create a whole new tree object, 24
kilobytes every time.

The obvious thought is to "virtualize" or "tree-ify" the namespace,
similar to how Git stores its own loose objects. (Is there a proper
name for this technique?)

This archive thread touches on that idea:

   http://kerneltrap.org/mailarchive/git/2006/2/8/200585

In particular (quoting Linus):

   "Final note: this means, for example, that git is relatively bad at
tracking a "hashed" nested file directory (like the one git itself
uses), because new files will end up randomly appearing in every
directory, and no directory is ever "stable".

I wonder how bad it is, so I did some calculations. What if Git could
"virtualize" the namespace when going from the working directory to
the object store, without the user ever realizing it. In other words,
Git would insert "fake" tree nodes to split apart large directories.
This leaves the question of how useful a working directory with a
million loose files is, but pretend for a moment that I have a Git
frontend that doesn't do full check-outs to a file system but allows a
user to grab single objects from the object store straight into
memory.

Ok, this is all back-of-the-napkin:

   H = sizeof(hash) = 20 bytes.
   G = sizeof(name) =  16 bytes.

   N = number of assets = let's say one million (hypothetical)

That means the theoretical minimum size of a tree object (without
virtualization) is:

  (H + G) * N = 36 megabytes

If we virtualize, using these parameters:

   B = number of bits we chop from the head of the filename, to create
the tree prefix (i.e., internal nodes)
   D = depth of virtualization (number of internal nodes we need to
traverse before we hit a bucket)

   (...in git's loose object store, B equals 8 and D equals 1.)

We get the following formulas:

                      2^(B*D)  = number of buckets (leaf tree objects)
                N / (2^(B*D)) = number of files per bucket
   (H+G) *  N / (2^(B*D)) = size of a bucket

     2^B * H = size of an internal node

The storage overhead of a single file commit then becomes the size of
a new bucket plus the size of the internal nodes back to the root.
I.e.,

    (D * (2^B) * H)     +     (H+G) * N / (2^(B*D))

The other interesting parameter is the total number of objects
involved in just virtualization overhead. For simplicity's sake, the
number of leaf objects is a good metric.

For different values of B and D, this leads to the following overhead
on a million-file directory.

    B = 0, D = 0 (how git treats large flat directories today)
        number of buckets: 1
        single change commit overhead: 36 megabyte

    B = 8, D = 1:
        number of buckets: 256
        single change commit overhead: 145736 bytes

    B = 8, D = 1:
        number of buckets: 65536
        single change commit overhead: 10780 bytes

    B = 3, D = 4:
        number of buckets: 4096
        single change commit overhead: 9424 bytes

Ultimately, the trade-off is between how may objects are involved in
the tree-ification, how many lookups need to be done to get an object,
and the storage overhead of commits.

The more I think about it, the less certain I am this kind of
data-structure is suitable for my application. I guess Linus's quote
at the top summed it up.

I'm wondering if anybody else has put any thought into this kind of
thing, and has suggestions for alternate data structures that could
enable Git to store large flat namespaces.

Thoughts are welcome. Or am I out to lunch?

Thanks,

Jaap Suter - http://jaapsuter.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Storing Large Flat Namespaces
  2008-09-08  1:58 Storing Large Flat Namespaces Jaap Suter
@ 2008-09-08  4:26 ` Jaap Suter
  2008-09-08 13:58 ` Nicolas Pitre
  1 sibling, 0 replies; 3+ messages in thread
From: Jaap Suter @ 2008-09-08  4:26 UTC (permalink / raw)
  To: git

> I'm investigating the possibility of using Git to store a large flat
> namespace. As an example, imagine a single directory containing
> thousands or millions of files, each named using a 16-byte guid,
> evenly distributed.

I realize that an obvious suggestion may be to figure out if there is a way to 
leverage the internal structure of the files to create a more Git-friendly tree 
layout.

Unfortunately the internal structure is largely opaque to me, and highly 
dynamic.

Thanks,

Jaap

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Storing Large Flat Namespaces
  2008-09-08  1:58 Storing Large Flat Namespaces Jaap Suter
  2008-09-08  4:26 ` Jaap Suter
@ 2008-09-08 13:58 ` Nicolas Pitre
  1 sibling, 0 replies; 3+ messages in thread
From: Nicolas Pitre @ 2008-09-08 13:58 UTC (permalink / raw)
  To: Jaap Suter; +Cc: git

On Sun, 7 Sep 2008, Jaap Suter wrote:

> Hello,
> 
> I'm investigating the possibility of using Git to store a large flat
> namespace. As an example, imagine a single directory containing
> thousands or millions of files, each named using a 16-byte guid,
> evenly distributed.
> 
> I'm aware that the Git object model makes various trade-offs,
> typically in favor of managing source-tree layouts - which it does
> extremely well. However, perhaps it is possible to carry some of Git's
> features as a content revision tracker over to other storage
> applications?
> 
> Currently, tree objects for large flat directories are quite large.
> Doing a git-init, git-add, git-push on a flat directory with 10,000
> files creates a tree object that is 24 kilobytes compressed. Any
> change to a single file would create a whole new tree object, 24
> kilobytes every time.

Sure.  But as soon as you repack that repository, there will be only one 
24-kilobyte tree object for the latest revision and older revisions will 
have their tree object stored as a delta against the latest one, which 
should be merely the size of the changed entry only.

As to the idea of splitting a large flat directory namespace into 
sub-trees... that would only help in the commit case which isn't that 
interesting since commits are done much less frequent than, say, 
directory walking.  In the later case, you might end up inflating less 
object data for files at the beginning of the directory but more for 
files towards the end, which on average won't be a gain.  And in all 
cases you're looking up more objects.  And, to benefit from any 
advantage left if any, you'd need to do significant surgery in the tree 
walking code, otherwise simply reconstructing the current object format 
from subtree objects won't bring any advantage over the current packed 
format using deltas.

The pack V4 should in theory be much better with large directories since 
the design of its tree object representation would allow walking partial 
tree by recursively following deltas making a complete tree walk almost 
linear in terms of processing and data touched.  But that too requires a 
major surgery of the tree walking code (the main reason holding me back 
from rushing into any pack v4 work at the moment).  But pack v4 won't 
change anything in the commit case -- you would still have to repack in 
order to benefit from it.

Nicolas

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-09-08 13:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-08  1:58 Storing Large Flat Namespaces Jaap Suter
2008-09-08  4:26 ` Jaap Suter
2008-09-08 13:58 ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).