From: Jeff King <peff@peff.net>
To: Nicolas Pitre <nico@cam.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Jakub Narebski <jnareb@gmail.com>,
Christopher Jefferson <caj@cs.st-andrews.ac.uk>,
git@vger.kernel.org
Subject: Re: Problem with large files on different OSes
Date: Wed, 27 May 2009 17:53:14 -0400 [thread overview]
Message-ID: <20090527215314.GA10362@coredump.intra.peff.net> (raw)
In-Reply-To: <alpine.LFD.2.00.0905271312220.3906@xanadu.home>
On Wed, May 27, 2009 at 01:37:26PM -0400, Nicolas Pitre wrote:
> My idea for handling big files is simply to:
>
> 1) Define a new parameter to determine what is considered a big file.
>
> 2) Store any file larger than the treshold defined in (1) directly into
> a pack of their own at "git add" time.
>
> 3) Never attempt to diff nor delta large objects, again according to
> (1) above. It is typical for large files not to be deltifiable, and
> a diff for files in the thousands of megabytes cannot possibly be
> sane.
What about large files that have a short metadata section that may
change? Versions with only the metadata changed delta well, and with a
custom diff driver, can produce useful diffs. And I don't think that is
an impractical or unlikely example; large files can often be tagged
media.
Linus' "split into multiple objects" approach means you could perhaps
split intelligently into metadata and "uninteresting data" sections
based on the file type. That would make things like rename detection
very fast. Of course it has the downside that you are cementing whatever
split you made into history for all time. And it means that two people
adding the same content might end up with different trees. Both things
that git tries to avoid.
I wonder if it would be useful to make such a split at _read_ time. That
is, still refer to the sha-1 of the whole content in the tree objects,
but have a separate cache that says "hash X splits to the concatenation
of Y,Z". Thus you can always refer to the "pure" object, both as a user,
and in the code. So we could avoid retrofitting all of the code -- just
some parts like diff might want to handle an object in multiple
segments.
-Peff
next prev parent reply other threads:[~2009-05-27 21:53 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson
2009-05-27 11:37 ` Andreas Ericsson
2009-05-27 13:02 ` Christopher Jefferson
2009-05-27 13:28 ` John Tapsell
2009-05-27 13:30 ` Christopher Jefferson
2009-05-27 13:32 ` John Tapsell
2009-05-27 14:01 ` Tomas Carnecky
2009-05-27 14:09 ` Christopher Jefferson
2009-05-27 14:22 ` Andreas Ericsson
2009-05-27 14:37 ` Jakub Narebski
2009-05-27 16:30 ` Linus Torvalds
2009-05-27 16:59 ` Linus Torvalds
2009-05-27 17:22 ` Christopher Jefferson
2009-05-27 17:30 ` Jakub Narebski
2009-05-27 17:37 ` Nicolas Pitre
2009-05-27 21:53 ` Jeff King [this message]
2009-05-27 22:07 ` Linus Torvalds
2009-05-27 23:09 ` Alan Manuel Gloria
2009-05-28 1:56 ` Linus Torvalds
2009-05-28 3:26 ` Nicolas Pitre
2009-05-28 4:21 ` Eric Raible
2009-05-28 4:30 ` Shawn O. Pearce
2009-05-28 5:52 ` Eric Raible
2009-05-28 8:52 ` Andreas Ericsson
2009-05-28 17:41 ` Nicolas Pitre
2009-05-28 19:43 ` Jeff King
2009-05-28 19:49 ` Linus Torvalds
2009-05-27 23:29 ` Nicolas Pitre
2009-05-28 20:00 ` Jeff King
2009-05-28 20:54 ` Nicolas Pitre
2009-05-28 21:21 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090527215314.GA10362@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=caj@cs.st-andrews.ac.uk \
--cc=git@vger.kernel.org \
--cc=jnareb@gmail.com \
--cc=nico@cam.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).