git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <mason@suse.com>
To: Linus Torvalds <torvalds@osdl.org>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] multi item packed files
Date: Thu, 21 Apr 2005 12:23:02 -0400	[thread overview]
Message-ID: <200504211223.03479.mason@suse.com> (raw)
In-Reply-To: <Pine.LNX.4.58.0504210832490.2344@ppc970.osdl.org>

On Thursday 21 April 2005 11:41, Linus Torvalds wrote:
> On Thu, 21 Apr 2005, Chris Mason wrote:
> > There have been a few threads on making git more space efficient, and
> > eventually someone mentions tiny files and space fragmentation.  Now that
> > git object names are decoupled from their compression, it's easier to
> > consider a a variety of compression algorithms.  I whipped up a really
> > silly "pack files together" compression.
>
> Careful.
>
> This is something that needs history to tell whether it's effective. In
> particular, if one file changes and another one does not, your packed
> archive now ends up being a new blob, so while you "saved space" by having
> just one blob for the object, in reality you didn't save any space at all
> because with the <x> files changing, you just guaranteed that the packed
> blob changes <x> times more often.

The packed blob lives in git but never makes it into a tree.  Lets say that I 
have a packed blob with files "a, b, c", and another packed blob with files 
"x, y, z".  Someone changes files, b and z and then runs update-cache b z.

Now we have 2 unchanged packed blobs: "a, b, c", "x, y, z",  and one new 
packed blob: "b_new, z_new".  This means that in order for the packing to 
help, we have to change more then one file at a time.  That's why it would be 
good to have update-cache include the write-tree and commit-tree.

>
> See? Your "packing in space" ends up also resulting in "packing in time",
> and you didn't actually win anything.
>
> (If you did a good job of packing, you hopefully didn't _lose_ anything
> either - you needed 1:<x> number of objects that took 1:<x> the space if
> the packing ended up perfect - but since you needed <x> times more of
> these objects unless they all change together, you end up with exactly the
> same space usage).
>
> So the argument is: you can't lose with the method, and you _can_ win.
> Right?
>
> Wrong. You most definitely _can_ lose: you end up having to optimize for
> one particular filesystem blocking size, and you'll lose on any other
> filesystem. And you'll lose on the special filesystem of "network
> traffic", which is byte-granular.
>
The patch does have one extra directory entry (for the packed blob), but from 
a network point of view roughly the same number of bytes should be copied.  
The hardlinks won't play nice with rsync though, soft links might be better.

packing isn't just about filesystem block sizes, it's about locality.  All the 
hashing means pretty much every access in git is random.  With packing we can 
at least try to put a single changeset together on disk.  Right now it 
doesn't matter much, but when the git tree is 6GB in two years we'll feel the 
pain.

> I don't want to pee on peoples parades, and I'm all for gathering numbers,
> but the thing is, the current git isn't actually all that bad, and I
> guarantee that it's hard to make it better without using delta
> representation. And the current thing is really really simple.
>

Grin, if I thought you wanted the patch I might have tried to pretty it up a 
little.  The point is that all the discussions about ways to make git use 
less space end up stuck in "but wait, that'll make a bunch of tiny files and 
filesystems aren't good at that".  So I believe some kind of packing is a 
required building block for any kind of delta storage.

-chris

  reply	other threads:[~2005-04-21 16:22 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-21 15:13 [PATCH] multi item packed files Chris Mason
2005-04-21 15:41 ` Linus Torvalds
2005-04-21 16:23   ` Chris Mason [this message]
2005-04-21 19:28   ` Krzysztof Halasa
2005-04-21 20:07     ` Linus Torvalds
2005-04-22  9:40       ` Krzysztof Halasa
2005-04-22 18:12         ` Martin Uecker
2005-04-21 20:22     ` Chris Mason
2005-04-21 22:47       ` Linus Torvalds
2005-04-22  0:16         ` Chris Mason
2005-04-22 16:22           ` Linus Torvalds
2005-04-22 18:58             ` Chris Mason
2005-04-22 19:43               ` Linus Torvalds
2005-04-22 20:32                 ` Chris Mason
2005-04-22 23:55                   ` Chris Mason
2005-04-25 22:20                     ` Chris Mason
2005-04-22  9:48       ` Krzysztof Halasa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200504211223.03479.mason@suse.com \
    --to=mason@suse.com \
    --cc=git@vger.kernel.org \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).