From: Jeff King <peff@peff.net>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolas Pitre <nico@cam.org>, Jakub Narebski <jnareb@gmail.com>,
Christopher Jefferson <caj@cs.st-andrews.ac.uk>,
git@vger.kernel.org
Subject: Re: Problem with large files on different OSes
Date: Thu, 28 May 2009 15:43:48 -0400 [thread overview]
Message-ID: <20090528194348.GH13499@coredump.intra.peff.net> (raw)
In-Reply-To: <alpine.LFD.2.01.0905271457310.3435@localhost.localdomain>
On Wed, May 27, 2009 at 03:07:49PM -0700, Linus Torvalds wrote:
> I suspect you wouldn't even need to. A regular delta algorithm would just
> work fairly well to find the common parts.
>
> Sure, if the offset of the data changes a lot, then you'd miss all the
> deltas between two (large) objects that now have data that traverses
> object boundaries, but especially if the split size is pretty large (ie
> several tens of MB, possibly something like 256M), that's still going to
> be a pretty rare event.
I confess that I'm not just interested in the _size_ of the deltas, but
also speeding up deltification and rename detection. And I'm interested
in files where we can benefit from their semantics a bit. So yes, with
some overlap you would end up with pretty reasonable deltas for
arbitrary binary, as you describe.
But I was thinking something more like splitting a JPEG into a small
first chunk that contains EXIF data, and a big secondary chunk that
contains the actual image data. The second half is marked as not
compressible (since it is already lossily compressed), and not
interesting for deltification. When we consider two images for
deltification, either:
1. they have the same "uninteresting" big part. In that case, you can
trivially make a delta by just replacing the smaller first part (or
even finding the optimal delta between the small parts). You never
even need to look at the second half.
2. they don't have the same uninteresting part. You can reject them as
delta candidates, because there is little chance the big parts will
be related, even for a different version of the same image.
And that extends to rename detection, as well. You can avoid looking at
the big part at all if you assume big parts with differing hashes are
going to be drastically different.
> > That would make things like rename detection very fast. Of course it has
> > the downside that you are cementing whatever split you made into history
> > for all time. And it means that two people adding the same content might
> > end up with different trees. Both things that git tries to avoid.
>
> It's the "I can no longer see that the files are the same by comparing
> SHA1's" that I personally dislike.
Right. I don't think splitting in the git data structure itself is worth
it for that reason. But deltification and rename detection keeping a
cache of smart splits that says "You can represent <sha-1> as this
concatenation of <sha-1>s" means they can still get some advantage (over
multiple runs, certainly, but possibly even over a single run: a smart
splitter might not even have to look at the entire file contents).
> So my "fixed chunk" approach would be nice in that if you have this kind
> of "chunkblob" entry, in the tree (and index) it would literally be one
> entry, and look like that:
>
> 100644 chunkblob <sha1>
But if I am understanding you correctly, you _are_ proposing to munge
the git data structure here. Which means that pre-chunkblob trees will
point to the raw blob, and then post-chunkblob trees will point to the
chunked representation. And that means not being able to use the sha-1
to see that they eventually point to the same content.
-Peff
next prev parent reply other threads:[~2009-05-28 19:44 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson
2009-05-27 11:37 ` Andreas Ericsson
2009-05-27 13:02 ` Christopher Jefferson
2009-05-27 13:28 ` John Tapsell
2009-05-27 13:30 ` Christopher Jefferson
2009-05-27 13:32 ` John Tapsell
2009-05-27 14:01 ` Tomas Carnecky
2009-05-27 14:09 ` Christopher Jefferson
2009-05-27 14:22 ` Andreas Ericsson
2009-05-27 14:37 ` Jakub Narebski
2009-05-27 16:30 ` Linus Torvalds
2009-05-27 16:59 ` Linus Torvalds
2009-05-27 17:22 ` Christopher Jefferson
2009-05-27 17:30 ` Jakub Narebski
2009-05-27 17:37 ` Nicolas Pitre
2009-05-27 21:53 ` Jeff King
2009-05-27 22:07 ` Linus Torvalds
2009-05-27 23:09 ` Alan Manuel Gloria
2009-05-28 1:56 ` Linus Torvalds
2009-05-28 3:26 ` Nicolas Pitre
2009-05-28 4:21 ` Eric Raible
2009-05-28 4:30 ` Shawn O. Pearce
2009-05-28 5:52 ` Eric Raible
2009-05-28 8:52 ` Andreas Ericsson
2009-05-28 17:41 ` Nicolas Pitre
2009-05-28 19:43 ` Jeff King [this message]
2009-05-28 19:49 ` Linus Torvalds
2009-05-27 23:29 ` Nicolas Pitre
2009-05-28 20:00 ` Jeff King
2009-05-28 20:54 ` Nicolas Pitre
2009-05-28 21:21 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090528194348.GH13499@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=caj@cs.st-andrews.ac.uk \
--cc=git@vger.kernel.org \
--cc=jnareb@gmail.com \
--cc=nico@cam.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).