From: Neal Kreitzinger <nkreitzinger@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Bo Chen <chen@chenirvine.org>,
Sergio <sergio.callegari@gmail.com>,
git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Sat, 31 Mar 2012 15:28:06 -0500 [thread overview]
Message-ID: <4F7768D6.3010400@gmail.com> (raw)
In-Reply-To: <20120330203430.GB20376@sigill.intra.peff.net>
On 3/30/2012 3:34 PM, Jeff King wrote:
> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>
>> The sub-problems of "delta for large file" problem.
>>
>> 1 large file
>>
> But let's take a step back for a moment. Forget about whether a file is
> binary or not. Imagine you want to store a very large file in git.
>
> What are the operations that will perform badly? How can we make them
> perform acceptably, and what tradeoffs must we make? E.g., the way the
> diff code is written, it would be very difficult to run "git diff" on a
> 2 gigabyte file. But is that actually a problem? Answering that means
> talking about the characteristics of 2 gigabyte files, and what we
> expect to see, and to what degree our tradeoffs will impact them.
>
> Here's a more concrete example. At first, even storing a 2 gigabyte file
> with "git add" was painful, because we would load the whole thing in
> memory. Repacking the repository was painful, because we had to rewrite
> the whole 2G file into a packfile. Nowadays, we stream large files
> directly into their own packfiles, and we have to pay the I/O only once
> (and the memory cost never). As a tradeoff, we no longer get delta
> compression of large objects. That's OK for some large objects, like
> movie files (which don't tend to delta well, anyway). But it's not for
> other objects, like virtual machine images, which do tend to delta well.
>
> So can we devise a solution which efficiently stores these
> delta-friendly objects, without losing the performance improvements we
> got with the stream-directly-to-packfile approach?
>
> One possible solution is breaking large files into smaller chunks using
> something like the bupsplit algorithm (and I won't go into the details
> here, as links to bup have already been mentioned elsewhere, and Junio's
> patches make a start at this sort of splitting).
>
(I'm no expert on "big-files" in git or elsewhere, but this thread is
immensely interesting to me as a git user who wants to track all sorts
of binary files and possibly large text files in the very near future,
ie. all components tied to a server build and upgrades beyond the
linux-distro/rpms and perhaps including them also.)
Let's take an even bigger step back for a moment. Who determines if a
file shall be a big-file or not? Git or the user? How is it determined
if a file shall be a "big-file" or not?
Who decides bigness:
Bigness seems to be relative to system resources. Does the user crunch
the numbers to determine if a file is big-file, or does git? If the
numbers are relative then should git query the system and make the
determination? Either way, once the system-resources are upgraded and
formerly "big-files" are no longer considered "big" how is the previous
history refactored to behave "non-big-file-like"? Conversely, if the
system-resources are re-distributed so that formerly non-big files are
now relatively big (ie, moved from powerful central server login to
laptops), how is the history refactored to accommodate the
newly-relative-bigness?
How bigness is decided:
There seems to be two basic types of big-files: big-worktree-files, and
big-history-files. A big-worktree-file that is delta-friendly is not a
big-history-file. A non-big-worktree-file that is delta-unfriendly is a
big-file-history problem. If you are working alone on an old computer
you are probably more concerned about big-worktree-files (memory). If
you are working in a large group making lots of changes to the same
files on a powerful server then you are probably more concerned about
big-history-file-size (diskspace). Of course, all are concerned about
big-worktree-files that are delta-unfriendly.
At what point is a delta-friendly file considered a "big-file"? I
assume that may depend on the degree delta-friendliness. I imagine that
a text file and vm-image differ in delta-friendliness by several degrees.
At what point(s) is a delta-unfriendly file considered a "big-file"? I
assume that may depend on the degree(s) of delta-unfriendliness. I
imagine a compiled program and compressed-container differ in
delta-unfriendliness by several degrees.
My understanding is that git does not ever delta-compress binary files.
That would mean even a small-worktree-binary-file becomes a
big-history-file over time.
v/r,
neal
next prev parent reply other threads:[~2012-03-31 20:28 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-28 4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-28 6:19 ` Nguyen Thai Ngoc Duy
2012-03-28 11:33 ` GSoC - Some questions on the idea of Sergio
2012-03-30 19:44 ` Bo Chen
2012-03-30 19:51 ` Bo Chen
2012-03-30 20:34 ` Jeff King
2012-03-30 23:08 ` Bo Chen
2012-03-31 11:02 ` Sergio Callegari
2012-03-31 16:18 ` Neal Kreitzinger
2012-04-02 21:07 ` Jeff King
2012-04-03 9:58 ` Sergio Callegari
2012-04-11 1:24 ` Neal Kreitzinger
2012-04-11 6:04 ` Jonathan Nieder
2012-04-11 16:29 ` Neal Kreitzinger
2012-04-11 22:09 ` Jeff King
2012-04-11 16:35 ` Neal Kreitzinger
2012-04-11 16:44 ` Neal Kreitzinger
2012-04-11 17:20 ` Jonathan Nieder
2012-04-11 18:51 ` Junio C Hamano
2012-04-11 19:03 ` Jonathan Nieder
2012-04-11 18:23 ` Neal Kreitzinger
2012-04-11 21:35 ` Jeff King
2012-04-12 19:29 ` Neal Kreitzinger
2012-04-12 21:03 ` Jeff King
[not found] ` <4F8A2EBD.1070407@gmail.com>
2012-04-15 2:15 ` Jeff King
2012-04-15 2:33 ` Neal Kreitzinger
2012-04-16 14:54 ` Jeff King
2012-05-10 21:43 ` Neal Kreitzinger
2012-05-10 22:39 ` Jeff King
2012-04-12 21:08 ` Neal Kreitzinger
2012-04-13 21:36 ` Bo Chen
2012-03-31 15:19 ` Neal Kreitzinger
2012-04-02 21:40 ` Jeff King
2012-04-02 22:19 ` Junio C Hamano
2012-04-03 10:07 ` Jeff King
2012-03-31 16:49 ` Neal Kreitzinger
2012-03-31 20:28 ` Neal Kreitzinger [this message]
2012-03-31 21:27 ` Bo Chen
2012-04-01 4:22 ` Nguyen Thai Ngoc Duy
2012-04-01 23:30 ` Bo Chen
2012-04-02 1:00 ` Nguyen Thai Ngoc Duy
2012-03-30 19:11 ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-30 19:54 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F7768D6.3010400@gmail.com \
--to=nkreitzinger@gmail.com \
--cc=chen@chenirvine.org \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
--cc=sergio.callegari@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).