* unable to run gc (or git repack -Adl )
@ 2010-01-29 22:29 Jon Nelson
2010-01-30 2:14 ` Nicolas Pitre
0 siblings, 1 reply; 3+ messages in thread
From: Jon Nelson @ 2010-01-29 22:29 UTC (permalink / raw)
To: git
Using 1.6.4.2 on openSUSE 11.2 (x86_64).
I have a beefy repo (du of 14GB) that I can't seem to run 'gc' on.
After running for over 2 hours, this is what I get:
Counting objects: 267676, done.
Compressing objects: 100% (217424/217424), done.
fatal: Unable to create temporary file: Too many open files
error: failed to run repack
Ugh!
I have 3 GB of memory (and 1GB of swap).
When I strace the various processes, I see some things I don't understand:
1. I see the 'git-repack' shell process scanning for .keep files. I
don't have any. Is there a shortcut to this?
It's also hugely inefficient. In this case, the code to identify non
.keep packs takes *4 minutes, 45 seconds*, lots of disk I/O, and lots
of CPU (it pegs one CPU at 100% for the entire duration). With a wee
bit of awk, I have reduced that to 2.3 seconds with VASTLY reduced I/O
and CPU requirements. Patch attached.
2. When git-pack-objects is being run, around the time it's 85% done
"compressing" it's very very very slow. Like, 2-5 objects every
second. The largest object in the repo is about 1MB.
3. When git pack objects is running and counting up the number of
objects, it is stat'ing files that aren't in the working directly, and
should not be, according to the index. If I switch the repo to be a
"bare" repository, then it doesn't do that, however, why is it doing
that in the first place?
4. Should git-pack-objects be reading the pack.idx files for counting
objects instead of the .pack files themselves?
5. There is no 5
6. Should git-pack-objects be closing .pack files after opening them?
I have 6559 .pack files.
7. Ultimately, how do I get "git gc" to work on this repo?
diff --git a/git-repack.sh b/git-repack.sh
index 1eb3bca..4358f96 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -62,15 +62,7 @@ case ",$all_into_one," in
,t,)
args= existing=
if [ -d "$PACKDIR" ]; then
- for e in `cd "$PACKDIR" && find . -type f -name '*.pack' \
- | sed -e 's/^\.\///' -e 's/\.pack$//'`
- do
- if [ -e "$PACKDIR/$e.keep" ]; then
- : keep
- else
- existing="$existing $e"
- fi
- done
+ existing=$( find . -type f -name '*.pack' -o -name
'*.pack.keep' | sed -e 's/^\.\///' | sort | awk '{ if ($0 ~ /\.keep$/)
{ N=substr($0, 0, length($0)-5); K[N]=0; } else { K[$0]=1; } } END {
for (k in K) { if (K[k] == 1) { printf "%s ", k; } } } ' )
if test -n "$existing" -a -n "$unpack_unreachable" -a \
-n "$remove_redundant"
then
--
Jon
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: unable to run gc (or git repack -Adl )
2010-01-29 22:29 unable to run gc (or git repack -Adl ) Jon Nelson
@ 2010-01-30 2:14 ` Nicolas Pitre
2010-01-30 2:45 ` Jon Nelson
0 siblings, 1 reply; 3+ messages in thread
From: Nicolas Pitre @ 2010-01-30 2:14 UTC (permalink / raw)
To: Jon Nelson; +Cc: git
On Fri, 29 Jan 2010, Jon Nelson wrote:
> Using 1.6.4.2 on openSUSE 11.2 (x86_64).
>
> I have a beefy repo (du of 14GB) that I can't seem to run 'gc' on.
>
> After running for over 2 hours, this is what I get:
>
> Counting objects: 267676, done.
> Compressing objects: 100% (217424/217424), done.
> fatal: Unable to create temporary file: Too many open files
> error: failed to run repack
Ouch!! Impressive.
> Ugh!
Indeed.
> I have 3 GB of memory (and 1GB of swap).
> When I strace the various processes, I see some things I don't understand:
>
> 1. I see the 'git-repack' shell process scanning for .keep files. I
> don't have any. Is there a shortcut to this?
>
> It's also hugely inefficient. In this case, the code to identify non
> .keep packs takes *4 minutes, 45 seconds*, lots of disk I/O, and lots
> of CPU (it pegs one CPU at 100% for the entire duration). With a wee
> bit of awk, I have reduced that to 2.3 seconds with VASTLY reduced I/O
> and CPU requirements. Patch attached.
Your patch will pick any .pack file in the repo not only from the
.git/objects/pack directory. There is no such thing as *.pack.keep
either.
> 2. When git-pack-objects is being run, around the time it's 85% done
> "compressing" it's very very very slow. Like, 2-5 objects every
> second. The largest object in the repo is about 1MB.
You probably consumed all RAM and started swapping at that point.
Or... you have many of those 1MB objects. If so try
using --window-memory=8M or similar.
> 3. When git pack objects is running and counting up the number of
> objects, it is stat'ing files that aren't in the working directly, and
> should not be, according to the index. If I switch the repo to be a
> "bare" repository, then it doesn't do that, however, why is it doing
> that in the first place?
A bare repository has no index. When the index is present though, it is
necessary to also pack objects it references. Why working directory
files would be stat()'d in that case I don't know.
> 4. Should git-pack-objects be reading the pack.idx files for counting
> objects instead of the .pack files themselves?
No. The whole point when "counting objects" is to perform a walk of the
history graph and capture the set of objects that are actually
referenced from your branches/tags and leave the unreferenced objects
behind. Also the order in which those objects are encountered during
that history walk is very important for efficient object placement in
the final pack. So this is much more involved than only listing the
objects contained in every packs.
> 5. There is no 5
I'm a flying buldozer.
> 6. Should git-pack-objects be closing .pack files after opening them?
> I have 6559 .pack files.
No wonder why you exhausted your file handles. And your repository must
be _horribly_ slow to work with, which might explain the
slowness/swappiness.
> 7. Ultimately, how do I get "git gc" to work on this repo?
... because you really really want to repack this mess ASAP of course.
Having so many packs means they must be relatively small. Yet, Git
allows up to 8GB of pack data to be mmap()'d at once on x86_64. This
means that an average of 3700 packs might be mapped at once, plus their
respective .idx files.
You could try:
git config core.packedGitLimit 256m
git config core.packedGitWindowSize 32m
git config pack.deltaCacheSize 1
and try repacking again with 'git gc --prune=now'. After the repack
succeeds, you should be able to remove the above configs from your
.git/config file.
Nicolas
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: unable to run gc (or git repack -Adl )
2010-01-30 2:14 ` Nicolas Pitre
@ 2010-01-30 2:45 ` Jon Nelson
0 siblings, 0 replies; 3+ messages in thread
From: Jon Nelson @ 2010-01-30 2:45 UTC (permalink / raw)
Cc: git
On Fri, Jan 29, 2010 at 8:14 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Fri, 29 Jan 2010, Jon Nelson wrote:
...
>> 1. I see the 'git-repack' shell process scanning for .keep files. I
>> don't have any. Is there a shortcut to this?
>>
>> It's also hugely inefficient. In this case, the code to identify non
>> .keep packs takes *4 minutes, 45 seconds*, lots of disk I/O, and lots
>> of CPU (it pegs one CPU at 100% for the entire duration). With a wee
>> bit of awk, I have reduced that to 2.3 seconds with VASTLY reduced I/O
>> and CPU requirements. Patch attached.
>
> Your patch will pick any .pack file in the repo not only from the
> .git/objects/pack directory. There is no such thing as *.pack.keep
> either.
Ugh. Yep. Patch amended. Still fast. Still wrong?
>> 3. When git pack objects is running and counting up the number of
>> objects, it is stat'ing files that aren't in the working directly, and
>> should not be, according to the index. If I switch the repo to be a
>> "bare" repository, then it doesn't do that, however, why is it doing
>> that in the first place?
>
> A bare repository has no index. When the index is present though, it is
> necessary to also pack objects it references. Why working directory
> files would be stat()'d in that case I don't know.
Inquiring minds want to know.
>> 4. Should git-pack-objects be reading the pack.idx files for counting
>> objects instead of the .pack files themselves?
>
> No. The whole point when "counting objects" is to perform a walk of the
> history graph and capture the set of objects that are actually
> referenced from your branches/tags and leave the unreferenced objects
> behind. Also the order in which those objects are encountered during
> that history walk is very important for efficient object placement in
> the final pack. So this is much more involved than only listing the
> objects contained in every packs.
Ah. For some reason I thought the .idx files contained not just a
straight listing but also the parent/child relationships as well.
> You could try:
>
> git config core.packedGitLimit 256m
> git config core.packedGitWindowSize 32m
> git config pack.deltaCacheSize 1
>
> and try repacking again with 'git gc --prune=now'. After the repack
> succeeds, you should be able to remove the above configs from your
> .git/config file.
I have since thrown out the repo and started over on this particular
experiment, issuing a 'git gc' rather more often. The config options
above are now dutifully scribbled down. Thanks!
diff --git a/git-repack.sh b/git-repack.sh
index 1eb3bca..3cef57d 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -62,15 +62,7 @@ case ",$all_into_one," in
,t,)
args= existing=
if [ -d "$PACKDIR" ]; then
- for e in `cd "$PACKDIR" && find . -type f -name '*.pack' \
- | sed -e 's/^\.\///' -e 's/\.pack$//'`
- do
- if [ -e "$PACKDIR/$e.keep" ]; then
- : keep
- else
- existing="$existing $e"
- fi
- done
+ existing=$( cd "$PACKDIR" && find . -type f -name
'*.pack' -o -name '*.keep' | sed -e 's/^\.\///' | sort | awk '{ if ($0
~ /\.keep$/) { N=substr($0, 0, length($0)-4) "pack"; K[N]=0; } else {
if ($0 in K) { } else { K[$0]=1; } } } END { for (k in K) { if (K[k]
== 1) { printf "%s ", k; } } } ' )
if test -n "$existing" -a -n "$unpack_unreachable" -a \
-n "$remove_redundant"
then
--
Jon
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-01-30 2:46 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-29 22:29 unable to run gc (or git repack -Adl ) Jon Nelson
2010-01-30 2:14 ` Nicolas Pitre
2010-01-30 2:45 ` Jon Nelson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).