All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Uecker <muecker@gmx.de>
To: git@vger.kernel.org
Subject: Re: space compression (again)
Date: Sat, 16 Apr 2005 19:37:02 +0200	[thread overview]
Message-ID: <20050416173702.GA12605@macavity> (raw)
In-Reply-To: <Pine.LNX.4.61.0504161101470.29343@cag.csail.mit.edu>

[-- Attachment #1: Type: text/plain, Size: 1895 bytes --]

On Sat, Apr 16, 2005 at 11:11:00AM -0400, C. Scott Ananian wrote:
> On Sat, 16 Apr 2005, Martin Uecker wrote:
> 
> >The right thing (TM) is to switch from SHA1 of compressed
> >content for the complete monolithic file to a merkle hash tree
> >of the uncompressed content. This would make the hash
> >independent of the actual storage method (chunked or not).
> 
> It would certainly be nice to change to a hash of the uncompressed 
> content, rather than a hash of the compressed content, but it's not 
> strictly necessary, since files are fetched all at once: there's not 'read 
> subrange' operation on blobs.
> 
> I assume 'merkle hash tree' is talking about:
>   http://www.open-content.net/specs/draft-jchapweske-thex-02.html
> ..which is very interesting, but not quite what I was thinking.
> The merkle hash approach seems to require fixed chunk boundaries.

I don't know what is written there, but I don't
consider fixed chunk boundaries part of the definition.

> The rsync approach does not use fixed chunk boundaries; this is necessary 
> to ensure good storage reuse for the expected case (ie; inserting a single 
> line at the start or in the middle of the file, which changes all the 
> chunk boundaries).

Yes. The chunk boundaries should be determined deterministically
from local properties of the data. Use a rolling checksum over
some small window and split the file it it hits a special value (0).
This is what the rsyncable patch to zlib does.

> Further, in the absence of subrange reads on blobs, it's not entirely 
> clear what using a merkle hash would buy you.

The whole design of git is a hash tree. If you extend
this tree structure into files you end up with merkle
hash trees. Everything else is just more complicated.

Martin
 

-- 
One night, when little Giana from Milano was fast asleep,
she had a strange dream.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

  reply	other threads:[~2005-04-16 17:35 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-15 17:19 space compression (again) C. Scott Ananian
2005-04-15 18:34 ` Linus Torvalds
2005-04-15 18:45   ` C. Scott Ananian
2005-04-15 19:00     ` Derek Fawcus
2005-04-15 19:11     ` Linus Torvalds
2005-04-16 14:39       ` Martin Uecker
2005-04-16 15:11         ` C. Scott Ananian
2005-04-16 17:37           ` Martin Uecker [this message]
2005-04-19 12:39             ` Martin Uecker
2005-04-15 18:50 ` Derek Fawcus
  -- strict thread matches above, loose matches on Subject: below --
2005-04-15 19:33 Ray Heasman
2005-04-16 12:29 ` David Lang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050416173702.GA12605@macavity \
    --to=muecker@gmx.de \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.