git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Leaving large binaries out of the packfile
@ 2010-06-10  6:25 Joshua Jensen
  2010-06-10 18:04 ` Shawn O. Pearce
  0 siblings, 1 reply; 5+ messages in thread
From: Joshua Jensen @ 2010-06-10  6:25 UTC (permalink / raw)
  To: git@vger.kernel.org

  Hi.

I've been dealing with a Subversion repository that contains a lot of 
large binaries.  Git generally seems to handle them reasonably enough, 
although it chokes under the pressure of a 'git gc' with this git-svn 
repository.  The repository packs total 2.7 gigabytes.  As it turns out, 
the 250 individual blob revisions worth of large binaries are about 2.4 
gigabytes of that.

Sometimes, 'git gc' runs out of memory.  I have to discover which file 
is causing the problem, so I can add it to .gitattributes with a 
'-delta' flag.  Mostly, though, the repacking takes forever, and I dread 
running the operation.

As an experiment, I added a '-pack' flag to .gitattributes.  This flag 
will leave the file type specified by the .gitattributes entry loose in 
the repository.  During a 'git gc', instead of recopying gigabytes of 
data each time, the loose objects are used.  The 'git gc' process runs 
very quick with this change.

The only issue I've found is in too_many_loose_objects().  gitk is 
always telling me the repository needs to be packed, obviously because 
of all the loose objects.

I haven't yet come up with a good idea for handling this.  I thought 
about putting the forced loose objects in a separate directory.  (This 
idea goes along with another that I want to build on top of this 
functionality, the ability to commit and have -pack binaries go to an 
alternates location.)  I have also thought about writing out a file with 
the count of forced loose objects and using that to drive the 
guesstimate made by too_many_loose_objects() down.

Does anyone have any thoughts?

Thanks!

Josh

---
  builtin/pack-objects.c |   25 +++++++++++++++++++++++++
  1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 214d7ef..f33a7fb 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -644,6 +644,28 @@ static int no_try_delta(const char *path)
      return 0;
  }

+static void setup_pack_attr_check(struct git_attr_check *check)
+{
+    static struct git_attr *attr_pack;
+
+    if (!attr_pack)
+        attr_pack = git_attr("pack");
+
+    check[0].attr = attr_pack;
+}
+
+static int must_pack(const char *path)
+{
+    struct git_attr_check check[1];
+
+    setup_pack_attr_check(check);
+    if (git_checkattr(path, ARRAY_SIZE(check), check))
+        return 1;
+    if (ATTR_FALSE(check->value))
+        return 0;
+    return 1;
+}
+
  static int add_object_entry(const unsigned char *sha1, enum 
object_type type,
                  const char *name, int exclude)
  {
@@ -667,6 +689,9 @@ static int add_object_entry(const unsigned char 
*sha1, enum object_type type,
      if (!exclude && local && has_loose_object_nonlocal(sha1))
          return 0;

+    if (name && !must_pack(name))
+        return 0;
+
      for (p = packed_git; p; p = p->next) {
          off_t offset = find_pack_entry_one(sha1, p);
          if (offset) {
--
1.7.1.msysgit.3.1.g108b5.dirty

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Leaving large binaries out of the packfile
  2010-06-10  6:25 Leaving large binaries out of the packfile Joshua Jensen
@ 2010-06-10 18:04 ` Shawn O. Pearce
  2010-06-11 15:29   ` Paolo Bonzini
  2010-06-24  6:32   ` Joshua Jensen
  0 siblings, 2 replies; 5+ messages in thread
From: Shawn O. Pearce @ 2010-06-10 18:04 UTC (permalink / raw)
  To: Joshua Jensen; +Cc: git@vger.kernel.org

Joshua Jensen <jjensen@workspacewhiz.com> wrote:
> Sometimes, 'git gc' runs out of memory.  I have to discover which file  
> is causing the problem, so I can add it to .gitattributes with a  
> '-delta' flag.  Mostly, though, the repacking takes forever, and I dread  
> running the operation.

If you have the list of big objects, you can put them into their
own pack file manually.  Feed their SHA-1 names on stdin to git
pack-objects, and save the resulting pack under .git/objects/pack.

Assuming the pack was called pack-DEADC0FFEE.pack, create a file
called pack-DEADC0FFEE.keep in the same directory.  This will stop
Git from trying to repack the contents of that pack file.

Now run `git gc` to remove those huge objects from the pack file
that contains all of the other stuff.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Leaving large binaries out of the packfile
  2010-06-10 18:04 ` Shawn O. Pearce
@ 2010-06-11 15:29   ` Paolo Bonzini
  2010-06-11 16:17     ` Shawn O. Pearce
  2010-06-24  6:32   ` Joshua Jensen
  1 sibling, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2010-06-11 15:29 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Joshua Jensen, git@vger.kernel.org

On 06/10/2010 08:04 PM, Shawn O. Pearce wrote:
> Joshua Jensen<jjensen@workspacewhiz.com>  wrote:
>> Sometimes, 'git gc' runs out of memory.  I have to discover which file
>> is causing the problem, so I can add it to .gitattributes with a
>> '-delta' flag.  Mostly, though, the repacking takes forever, and I dread
>> running the operation.
>
> If you have the list of big objects, you can put them into their
> own pack file manually.  Feed their SHA-1 names on stdin to git
> pack-objects, and save the resulting pack under .git/objects/pack.

Do you know any simpler way than

git log --pretty=format:%H | while read x; do
   git ls-tree $x -- ChangeLog | awk '{print $3}'
done | sort -u

to do this?  I thought it would be nice to add --sha1-only to 
git-ls-tree, but maybe I'm missing some other trick.

> Assuming the pack was called pack-DEADC0FFEE.pack, create a file
> called pack-DEADC0FFEE.keep in the same directory.  This will stop
> Git from trying to repack the contents of that pack file.
>
> Now run `git gc` to remove those huge objects from the pack file
> that contains all of the other stuff.

That obviously wouldn't help if these large binaries are updated often, 
however.

Paolo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Leaving large binaries out of the packfile
  2010-06-11 15:29   ` Paolo Bonzini
@ 2010-06-11 16:17     ` Shawn O. Pearce
  0 siblings, 0 replies; 5+ messages in thread
From: Shawn O. Pearce @ 2010-06-11 16:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Joshua Jensen, git@vger.kernel.org

Paolo Bonzini <bonzini@gnu.org> wrote:
> On 06/10/2010 08:04 PM, Shawn O. Pearce wrote:
>> Joshua Jensen<jjensen@workspacewhiz.com>  wrote:
>>> Sometimes, 'git gc' runs out of memory.  I have to discover which file
>>> is causing the problem, so I can add it to .gitattributes with a
>>> '-delta' flag.  Mostly, though, the repacking takes forever, and I dread
>>> running the operation.
>>
>> If you have the list of big objects, you can put them into their
>> own pack file manually.  Feed their SHA-1 names on stdin to git
>> pack-objects, and save the resulting pack under .git/objects/pack.
>
> Do you know any simpler way than
>
> git log --pretty=format:%H | while read x; do
>   git ls-tree $x -- ChangeLog | awk '{print $3}'
> done | sort -u
>
> to do this?  I thought it would be nice to add --sha1-only to  
> git-ls-tree, but maybe I'm missing some other trick.

Maybe

  git rev-list --objects HEAD | grep ' ChangeLog'

pack-objects wants the output of rev-list --objects as input, file
name and all.  So its just a matter of selecting the right lines
from its output.

>> Assuming the pack was called pack-DEADC0FFEE.pack, create a file
>> called pack-DEADC0FFEE.keep in the same directory.  This will stop
>> Git from trying to repack the contents of that pack file.
>>
>> Now run `git gc` to remove those huge objects from the pack file
>> that contains all of the other stuff.
>
> That obviously wouldn't help if these large binaries are updated often,  
> however.

No, it doesn't.  But you still could do this on a periodic basis.
That way you only drag around a handful of recently created large
binaries during a typical `git gc`, and not the entire project's
history of them.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Leaving large binaries out of the packfile
  2010-06-10 18:04 ` Shawn O. Pearce
  2010-06-11 15:29   ` Paolo Bonzini
@ 2010-06-24  6:32   ` Joshua Jensen
  1 sibling, 0 replies; 5+ messages in thread
From: Joshua Jensen @ 2010-06-24  6:32 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git@vger.kernel.org

  ----- Original Message -----
From: Shawn O. Pearce
Date: 6/10/2010 12:04 PM
> Joshua Jensen<jjensen@workspacewhiz.com>  wrote:
>> Sometimes, 'git gc' runs out of memory.  I have to discover which file
>> is causing the problem, so I can add it to .gitattributes with a
>> '-delta' flag.  Mostly, though, the repacking takes forever, and I dread
>> running the operation.
> If you have the list of big objects, you can put them into their
> own pack file manually.  Feed their SHA-1 names on stdin to git
> pack-objects, and save the resulting pack under .git/objects/pack.
>
> Assuming the pack was called pack-DEADC0FFEE.pack, create a file
> called pack-DEADC0FFEE.keep in the same directory.  This will stop
> Git from trying to repack the contents of that pack file.
>
> Now run `git gc` to remove those huge objects from the pack file
> that contains all of the other stuff
Pardon the late response.

This method can work, but it is a manual process.  I am interested in a 
method where Git can make the determination for me based on a wildcard 
and flag from .gitattributes.

I am still playing with the feature within a multi-gigabyte repository 
with lots of large binaries.  I'll post more about it when some 
additional changes have been made.

Thanks!

Josh

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-06-24  6:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-10  6:25 Leaving large binaries out of the packfile Joshua Jensen
2010-06-10 18:04 ` Shawn O. Pearce
2010-06-11 15:29   ` Paolo Bonzini
2010-06-11 16:17     ` Shawn O. Pearce
2010-06-24  6:32   ` Joshua Jensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).