git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Alan Manuel Gloria <almkglor@gmail.com>
Cc: Jeff King <peff@peff.net>, Nicolas Pitre <nico@cam.org>,
	Jakub Narebski <jnareb@gmail.com>,
	Christopher Jefferson <caj@cs.st-andrews.ac.uk>,
	git@vger.kernel.org
Subject: Re: Problem with large files on different OSes
Date: Wed, 27 May 2009 18:56:49 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LFD.2.01.0905271825520.3435@localhost.localdomain> (raw)
In-Reply-To: <f95910c20905271609u63d04965oa38b8af34d7704c1@mail.gmail.com>



On Thu, 28 May 2009, Alan Manuel Gloria wrote:
> 
> If you'd prefer someone else to hack it, can you at least give me some
> pointers on which code files to start looking?  I'd really like to
> have proper large-file-packing support, where large file is anything
> much bigger than a megabyte or so.
> 
> Admittedly I'm not a filesystems guy and I can just barely grok git's
> blobs (they're the actual files, right? except they're named with
> their hash), but not packs (err, a bunch of files?) and trees (brown
> and green stuff you plant?).  Still, I can try to learn it.

The packs is a big part of the complexity.

If you were to keep the big files as unpacked blobs, that would be 
fairly simple - but the pack-file format is needed for fetching and 
pushing things, so it's not really an option.

For your particular case, the simplest approach is probably to just 
limit the delta search. Something like just saying "if the object is 
larger than X, don't even bother to try to delta it, and just pack it 
without delta compression". 

The code would still load that whole object in one go, but it sounds like 
you can handle _one_ object at a time. So for your case, I don't think you 
need a fundamental git change - you'd be ok with just an inefficient pack 
format for large files that are very expensive to pack otherwise.

You can already do that by using .gitattributes to not delta entries 
by name, but maybe it's worth doing explicitly by size too.

I realize that the "delta" attribute is apparently almost totally 
undocumented. But if your big blobs have a particular name pattern, what 
you should try is to do something like

 - in your '.gitattributes' file (or .git/info/attributes if you don't 
   want to check it in), add a line like

	*.img !delta

   which now sets the 'delta' attribute to false for all objects that 
   match the '*.img' pattern.

 - see if pack creation is now acceptable (ie do a "git gc" or try to push 
   somewhere)

Something like the following may also work, as a more generic "just don't 
even bother trying to delta huge files".

Totally untested. Maybe it works. Maybe it doesn't.

		Linus

---
 Documentation/config.txt |    7 +++++++
 builtin-pack-objects.c   |    9 +++++++++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 2c03162..8c21027 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1238,6 +1238,13 @@ older version of git. If the `{asterisk}.pack` file is smaller than 2 GB, howeve
 you can use linkgit:git-index-pack[1] on the *.pack file to regenerate
 the `{asterisk}.idx` file.
 
+pack.packDeltaLimit::
+	The default maximum size of objects that we try to delta.
++
+Big files can be very expensive to delta, and if they are large binary
+blobs, there is likely little upside to it anyway. So just pack them
+as-is, and don't waste time on them.
+
 pack.packSizeLimit::
 	The default maximum size of a pack.  This setting only affects
 	packing to a file, i.e. the git:// protocol is unaffected.  It
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 9742b45..9a0072b 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -85,6 +85,7 @@ static struct progress *progress_state;
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 static int pack_compression_seen;
 
+static unsigned long pack_delta_limit = 64*1024*1024;
 static unsigned long delta_cache_size = 0;
 static unsigned long max_delta_cache_size = 0;
 static unsigned long cache_max_small_delta_size = 1000;
@@ -1270,6 +1271,10 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (trg_entry->type != src_entry->type)
 		return -1;
 
+	/* If we limit delta generation, don't even bother for larger blobs */
+	if (pack_delta_limit && trg_entry->size >= pack_delta_limit)
+		return -1;
+
 	/*
 	 * We do not bother to try a delta that we discarded
 	 * on an earlier try, but only when reusing delta data.
@@ -1865,6 +1870,10 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		pack_size_limit_cfg = git_config_ulong(k, v);
 		return 0;
 	}
+	if (!strcmp(k, "pack.packdeltalimit")) {
+		pack_delta_limit = git_config_ulong(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 

  reply	other threads:[~2009-05-28  1:57 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-27 10:52 Problem with large files on different OSes Christopher Jefferson
2009-05-27 11:37 ` Andreas Ericsson
2009-05-27 13:02   ` Christopher Jefferson
2009-05-27 13:28   ` John Tapsell
2009-05-27 13:30     ` Christopher Jefferson
2009-05-27 13:32       ` John Tapsell
2009-05-27 14:01 ` Tomas Carnecky
2009-05-27 14:09   ` Christopher Jefferson
2009-05-27 14:22     ` Andreas Ericsson
2009-05-27 14:37 ` Jakub Narebski
2009-05-27 16:30   ` Linus Torvalds
2009-05-27 16:59     ` Linus Torvalds
2009-05-27 17:22       ` Christopher Jefferson
2009-05-27 17:30         ` Jakub Narebski
2009-05-27 17:37       ` Nicolas Pitre
2009-05-27 21:53         ` Jeff King
2009-05-27 22:07           ` Linus Torvalds
2009-05-27 23:09             ` Alan Manuel Gloria
2009-05-28  1:56               ` Linus Torvalds [this message]
2009-05-28  3:26                 ` Nicolas Pitre
2009-05-28  4:21                   ` Eric Raible
2009-05-28  4:30                     ` Shawn O. Pearce
2009-05-28  5:52                       ` Eric Raible
2009-05-28  8:52                         ` Andreas Ericsson
2009-05-28 17:41                     ` Nicolas Pitre
2009-05-28 19:43             ` Jeff King
2009-05-28 19:49               ` Linus Torvalds
2009-05-27 23:29           ` Nicolas Pitre
2009-05-28 20:00             ` Jeff King
2009-05-28 20:54               ` Nicolas Pitre
2009-05-28 21:21                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.01.0905271825520.3435@localhost.localdomain \
    --to=torvalds@linux-foundation.org \
    --cc=almkglor@gmail.com \
    --cc=caj@cs.st-andrews.ac.uk \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=nico@cam.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).