* Leaving large binaries out of the packfile
@ 2010-06-10 6:25 Joshua Jensen
2010-06-10 18:04 ` Shawn O. Pearce
0 siblings, 1 reply; 5+ messages in thread
From: Joshua Jensen @ 2010-06-10 6:25 UTC (permalink / raw)
To: git@vger.kernel.org
Hi.
I've been dealing with a Subversion repository that contains a lot of
large binaries. Git generally seems to handle them reasonably enough,
although it chokes under the pressure of a 'git gc' with this git-svn
repository. The repository packs total 2.7 gigabytes. As it turns out,
the 250 individual blob revisions worth of large binaries are about 2.4
gigabytes of that.
Sometimes, 'git gc' runs out of memory. I have to discover which file
is causing the problem, so I can add it to .gitattributes with a
'-delta' flag. Mostly, though, the repacking takes forever, and I dread
running the operation.
As an experiment, I added a '-pack' flag to .gitattributes. This flag
will leave the file type specified by the .gitattributes entry loose in
the repository. During a 'git gc', instead of recopying gigabytes of
data each time, the loose objects are used. The 'git gc' process runs
very quick with this change.
The only issue I've found is in too_many_loose_objects(). gitk is
always telling me the repository needs to be packed, obviously because
of all the loose objects.
I haven't yet come up with a good idea for handling this. I thought
about putting the forced loose objects in a separate directory. (This
idea goes along with another that I want to build on top of this
functionality, the ability to commit and have -pack binaries go to an
alternates location.) I have also thought about writing out a file with
the count of forced loose objects and using that to drive the
guesstimate made by too_many_loose_objects() down.
Does anyone have any thoughts?
Thanks!
Josh
---
builtin/pack-objects.c | 25 +++++++++++++++++++++++++
1 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 214d7ef..f33a7fb 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -644,6 +644,28 @@ static int no_try_delta(const char *path)
return 0;
}
+static void setup_pack_attr_check(struct git_attr_check *check)
+{
+ static struct git_attr *attr_pack;
+
+ if (!attr_pack)
+ attr_pack = git_attr("pack");
+
+ check[0].attr = attr_pack;
+}
+
+static int must_pack(const char *path)
+{
+ struct git_attr_check check[1];
+
+ setup_pack_attr_check(check);
+ if (git_checkattr(path, ARRAY_SIZE(check), check))
+ return 1;
+ if (ATTR_FALSE(check->value))
+ return 0;
+ return 1;
+}
+
static int add_object_entry(const unsigned char *sha1, enum
object_type type,
const char *name, int exclude)
{
@@ -667,6 +689,9 @@ static int add_object_entry(const unsigned char
*sha1, enum object_type type,
if (!exclude && local && has_loose_object_nonlocal(sha1))
return 0;
+ if (name && !must_pack(name))
+ return 0;
+
for (p = packed_git; p; p = p->next) {
off_t offset = find_pack_entry_one(sha1, p);
if (offset) {
--
1.7.1.msysgit.3.1.g108b5.dirty
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: Leaving large binaries out of the packfile
2010-06-10 6:25 Leaving large binaries out of the packfile Joshua Jensen
@ 2010-06-10 18:04 ` Shawn O. Pearce
2010-06-11 15:29 ` Paolo Bonzini
2010-06-24 6:32 ` Joshua Jensen
0 siblings, 2 replies; 5+ messages in thread
From: Shawn O. Pearce @ 2010-06-10 18:04 UTC (permalink / raw)
To: Joshua Jensen; +Cc: git@vger.kernel.org
Joshua Jensen <jjensen@workspacewhiz.com> wrote:
> Sometimes, 'git gc' runs out of memory. I have to discover which file
> is causing the problem, so I can add it to .gitattributes with a
> '-delta' flag. Mostly, though, the repacking takes forever, and I dread
> running the operation.
If you have the list of big objects, you can put them into their
own pack file manually. Feed their SHA-1 names on stdin to git
pack-objects, and save the resulting pack under .git/objects/pack.
Assuming the pack was called pack-DEADC0FFEE.pack, create a file
called pack-DEADC0FFEE.keep in the same directory. This will stop
Git from trying to repack the contents of that pack file.
Now run `git gc` to remove those huge objects from the pack file
that contains all of the other stuff.
--
Shawn.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Leaving large binaries out of the packfile
2010-06-10 18:04 ` Shawn O. Pearce
@ 2010-06-11 15:29 ` Paolo Bonzini
2010-06-11 16:17 ` Shawn O. Pearce
2010-06-24 6:32 ` Joshua Jensen
1 sibling, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2010-06-11 15:29 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: Joshua Jensen, git@vger.kernel.org
On 06/10/2010 08:04 PM, Shawn O. Pearce wrote:
> Joshua Jensen<jjensen@workspacewhiz.com> wrote:
>> Sometimes, 'git gc' runs out of memory. I have to discover which file
>> is causing the problem, so I can add it to .gitattributes with a
>> '-delta' flag. Mostly, though, the repacking takes forever, and I dread
>> running the operation.
>
> If you have the list of big objects, you can put them into their
> own pack file manually. Feed their SHA-1 names on stdin to git
> pack-objects, and save the resulting pack under .git/objects/pack.
Do you know any simpler way than
git log --pretty=format:%H | while read x; do
git ls-tree $x -- ChangeLog | awk '{print $3}'
done | sort -u
to do this? I thought it would be nice to add --sha1-only to
git-ls-tree, but maybe I'm missing some other trick.
> Assuming the pack was called pack-DEADC0FFEE.pack, create a file
> called pack-DEADC0FFEE.keep in the same directory. This will stop
> Git from trying to repack the contents of that pack file.
>
> Now run `git gc` to remove those huge objects from the pack file
> that contains all of the other stuff.
That obviously wouldn't help if these large binaries are updated often,
however.
Paolo
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Leaving large binaries out of the packfile
2010-06-11 15:29 ` Paolo Bonzini
@ 2010-06-11 16:17 ` Shawn O. Pearce
0 siblings, 0 replies; 5+ messages in thread
From: Shawn O. Pearce @ 2010-06-11 16:17 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Joshua Jensen, git@vger.kernel.org
Paolo Bonzini <bonzini@gnu.org> wrote:
> On 06/10/2010 08:04 PM, Shawn O. Pearce wrote:
>> Joshua Jensen<jjensen@workspacewhiz.com> wrote:
>>> Sometimes, 'git gc' runs out of memory. I have to discover which file
>>> is causing the problem, so I can add it to .gitattributes with a
>>> '-delta' flag. Mostly, though, the repacking takes forever, and I dread
>>> running the operation.
>>
>> If you have the list of big objects, you can put them into their
>> own pack file manually. Feed their SHA-1 names on stdin to git
>> pack-objects, and save the resulting pack under .git/objects/pack.
>
> Do you know any simpler way than
>
> git log --pretty=format:%H | while read x; do
> git ls-tree $x -- ChangeLog | awk '{print $3}'
> done | sort -u
>
> to do this? I thought it would be nice to add --sha1-only to
> git-ls-tree, but maybe I'm missing some other trick.
Maybe
git rev-list --objects HEAD | grep ' ChangeLog'
pack-objects wants the output of rev-list --objects as input, file
name and all. So its just a matter of selecting the right lines
from its output.
>> Assuming the pack was called pack-DEADC0FFEE.pack, create a file
>> called pack-DEADC0FFEE.keep in the same directory. This will stop
>> Git from trying to repack the contents of that pack file.
>>
>> Now run `git gc` to remove those huge objects from the pack file
>> that contains all of the other stuff.
>
> That obviously wouldn't help if these large binaries are updated often,
> however.
No, it doesn't. But you still could do this on a periodic basis.
That way you only drag around a handful of recently created large
binaries during a typical `git gc`, and not the entire project's
history of them.
--
Shawn.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Leaving large binaries out of the packfile
2010-06-10 18:04 ` Shawn O. Pearce
2010-06-11 15:29 ` Paolo Bonzini
@ 2010-06-24 6:32 ` Joshua Jensen
1 sibling, 0 replies; 5+ messages in thread
From: Joshua Jensen @ 2010-06-24 6:32 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git@vger.kernel.org
----- Original Message -----
From: Shawn O. Pearce
Date: 6/10/2010 12:04 PM
> Joshua Jensen<jjensen@workspacewhiz.com> wrote:
>> Sometimes, 'git gc' runs out of memory. I have to discover which file
>> is causing the problem, so I can add it to .gitattributes with a
>> '-delta' flag. Mostly, though, the repacking takes forever, and I dread
>> running the operation.
> If you have the list of big objects, you can put them into their
> own pack file manually. Feed their SHA-1 names on stdin to git
> pack-objects, and save the resulting pack under .git/objects/pack.
>
> Assuming the pack was called pack-DEADC0FFEE.pack, create a file
> called pack-DEADC0FFEE.keep in the same directory. This will stop
> Git from trying to repack the contents of that pack file.
>
> Now run `git gc` to remove those huge objects from the pack file
> that contains all of the other stuff
Pardon the late response.
This method can work, but it is a manual process. I am interested in a
method where Git can make the determination for me based on a wildcard
and flag from .gitattributes.
I am still playing with the feature within a multi-gigabyte repository
with lots of large binaries. I'll post more about it when some
additional changes have been made.
Thanks!
Josh
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-06-24 6:32 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-10 6:25 Leaving large binaries out of the packfile Joshua Jensen
2010-06-10 18:04 ` Shawn O. Pearce
2010-06-11 15:29 ` Paolo Bonzini
2010-06-11 16:17 ` Shawn O. Pearce
2010-06-24 6:32 ` Joshua Jensen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).