* git gc does not clean tmp_pack* files
@ 2024-12-19 2:19 Boomman
2024-12-19 5:42 ` Jeff King
0 siblings, 1 reply; 8+ messages in thread
From: Boomman @ 2024-12-19 2:19 UTC (permalink / raw)
To: git
Hello,
I'm currently hitting an issue trying to garbage collect a git repo in
case of low disk space.
After running "git gc" a bunch of times I tried to clean up more and
more space on the disk not realizing that when "git gc" fails it just
leaves the tmp_pack file there.
D:\Platform>git gc
...blah...
fatal: sha1 file '.git/objects/pack/tmp_pack_FG1inp' write error. Out
of diskspace
fatal: failed to run repack
D:\Platform>git gc
...blah...
fatal: sha1 file '.git/objects/pack/tmp_pack_IFvamY' write error. Out
of diskspace
fatal: failed to run repack
D:\Platform>git gc
...blah...
fatal: sha1 file '.git/objects/pack/tmp_pack_khHCC9' write error. Out
of diskspace
fatal: failed to run repack
D:\Platform>dir .git\objects\pack\tmp*
Directory of D:\Platform\.git\objects\pack
12/18/2024 05:33 PM 7,367,032,832 tmp_pack_FG1inp
12/18/2024 05:35 PM 3,787,194,368 tmp_pack_IFvamY
12/18/2024 05:39 PM 7,713,062,912 tmp_pack_khHCC9
09/11/2024 11:33 AM 3,068,002,304 tmp_pack_XTVFUi
4 File(s) 21,935,292,416 bytes
0 Dir(s) 339,968 bytes free
I believe that before trying to write *anything* to disk "git gc"
should try to take exclusive handles on these and wipe them, ideally
by default. The total size of these tmp* files is multiple times
larger than the repo I'm trying to compact, so if the command just did
this pre-cleaning I'd not have hit this problem once I cleaned enough
disk space.
Please let me know your thoughts on this.
-Vitaly
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: git gc does not clean tmp_pack* files
2024-12-19 2:19 git gc does not clean tmp_pack* files Boomman
@ 2024-12-19 5:42 ` Jeff King
2024-12-19 8:26 ` Boomman
0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2024-12-19 5:42 UTC (permalink / raw)
To: Boomman; +Cc: git
On Wed, Dec 18, 2024 at 06:19:06PM -0800, Boomman wrote:
> D:\Platform>dir .git\objects\pack\tmp*
> Directory of D:\Platform\.git\objects\pack
>
> 12/18/2024 05:33 PM 7,367,032,832 tmp_pack_FG1inp
> 12/18/2024 05:35 PM 3,787,194,368 tmp_pack_IFvamY
> 12/18/2024 05:39 PM 7,713,062,912 tmp_pack_khHCC9
> 09/11/2024 11:33 AM 3,068,002,304 tmp_pack_XTVFUi
> 4 File(s) 21,935,292,416 bytes
> 0 Dir(s) 339,968 bytes free
>
> I believe that before trying to write *anything* to disk "git gc"
> should try to take exclusive handles on these and wipe them, ideally
> by default. The total size of these tmp* files is multiple times
> larger than the repo I'm trying to compact, so if the command just did
> this pre-cleaning I'd not have hit this problem once I cleaned enough
> disk space.
git-gc does know how to clean up these files, but they are subject to
the same mtime grace period that loose objects are. This is to avoid
deleting a file that is being actively used by a simultaneous process.
Try "git gc --prune=now" if you know there are no other active processes
in the repository.
We usually prune things after finishing the repack. So if you're running
out of disk space to repack, there might be a chicken-and-egg problem.
You can run "git prune" manually in that case.
Possibly git-gc should prune first for this reason, but I'd be hesitant
to do so for actual loose objects. It's a little weird that tempfile
cleanup is lumped in with loose object cleanup, and is mostly
historical. Possibly those should be split.
-Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git gc does not clean tmp_pack* files
2024-12-19 5:42 ` Jeff King
@ 2024-12-19 8:26 ` Boomman
2024-12-19 11:17 ` Junio C Hamano
0 siblings, 1 reply; 8+ messages in thread
From: Boomman @ 2024-12-19 8:26 UTC (permalink / raw)
To: Jeff King; +Cc: git
Yes, if the behavior in case of running out of disk space is to just
leave the malformed file there, it stands to reason that cleaning up
those malformed files should be the first operation to do for gc.
At the very least, git should notify the user that they've got all of
those tmp_pack files totaling 20+ GB in the object folder before it
will declare that it can't write a single byte into a lock file
because previous "git gc" calls exhausted all the disk space.
I know that on Windows it's possible to take an exclusive write lock
on a file while the process is running, so at least on Windows those
tmp_pack files could be "soft-try" cleaned up without affecting other
running git processes, not sure if it's possible for other supported
OSes.
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-lockfileex
-Vitaly
>
> On Wed, Dec 18, 2024 at 06:19:06PM -0800, Boomman wrote:
>
> > D:\Platform>dir .git\objects\pack\tmp*
> > Directory of D:\Platform\.git\objects\pack
> >
> > 12/18/2024 05:33 PM 7,367,032,832 tmp_pack_FG1inp
> > 12/18/2024 05:35 PM 3,787,194,368 tmp_pack_IFvamY
> > 12/18/2024 05:39 PM 7,713,062,912 tmp_pack_khHCC9
> > 09/11/2024 11:33 AM 3,068,002,304 tmp_pack_XTVFUi
> > 4 File(s) 21,935,292,416 bytes
> > 0 Dir(s) 339,968 bytes free
> >
> > I believe that before trying to write *anything* to disk "git gc"
> > should try to take exclusive handles on these and wipe them, ideally
> > by default. The total size of these tmp* files is multiple times
> > larger than the repo I'm trying to compact, so if the command just did
> > this pre-cleaning I'd not have hit this problem once I cleaned enough
> > disk space.
>
> git-gc does know how to clean up these files, but they are subject to
> the same mtime grace period that loose objects are. This is to avoid
> deleting a file that is being actively used by a simultaneous process.
>
> Try "git gc --prune=now" if you know there are no other active processes
> in the repository.
>
> We usually prune things after finishing the repack. So if you're running
> out of disk space to repack, there might be a chicken-and-egg problem.
> You can run "git prune" manually in that case.
>
> Possibly git-gc should prune first for this reason, but I'd be hesitant
> to do so for actual loose objects. It's a little weird that tempfile
> cleanup is lumped in with loose object cleanup, and is mostly
> historical. Possibly those should be split.
>
> -Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git gc does not clean tmp_pack* files
2024-12-19 8:26 ` Boomman
@ 2024-12-19 11:17 ` Junio C Hamano
2024-12-20 9:05 ` Jeff King
0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2024-12-19 11:17 UTC (permalink / raw)
To: Boomman; +Cc: Jeff King, git
Boomman <boomman37@gmail.com> writes:
> Yes, if the behavior in case of running out of disk space is to just
> leave the malformed file there, it stands to reason that cleaning up
> those malformed files should be the first operation to do for gc.
It is misleading to call them malformed, isn't it? When a Git
process creates a packfile (or loose object file for that matter),
they are written under these tmp_* names. When the processes die
without finalizing these (either removing or renaming into their
final names), they are left behind, and it would be better if we can
remove it _before_ another process wants to consume more disk space.
But the issue is how you tell which one of these "malformed" files
are still being written and will be finalized, and which ones are
leftover ones. You want to remove the latter without molesting the
former. And you want to do so in a portable way, possibly even
across the network file systems.
I guess, as Peff alluded to, we could do at the beginning of "gc" to
prune only these "possibly in progress, possibly leftover" files
that are too old, repack the clearly "finalized" ones, and then
prune again, this time including the "finalized" ones that are no
longer in use, which would help the creation of new packfiles from
being blocked by these leftover files that are hoarding disk quota.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git gc does not clean tmp_pack* files
2024-12-19 11:17 ` Junio C Hamano
@ 2024-12-20 9:05 ` Jeff King
2024-12-21 1:17 ` Boomman
0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2024-12-20 9:05 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Boomman, git
On Thu, Dec 19, 2024 at 03:17:01AM -0800, Junio C Hamano wrote:
> Boomman <boomman37@gmail.com> writes:
>
> > Yes, if the behavior in case of running out of disk space is to just
> > leave the malformed file there, it stands to reason that cleaning up
> > those malformed files should be the first operation to do for gc.
>
> It is misleading to call them malformed, isn't it? When a Git
> process creates a packfile (or loose object file for that matter),
> they are written under these tmp_* names. When the processes die
> without finalizing these (either removing or renaming into their
> final names), they are left behind, and it would be better if we can
> remove it _before_ another process wants to consume more disk space.
We usually automatically clean up our tempfiles if we encounter an
error, but don't do so for partially written packs. I think this is
mostly historical, though occasionally it can be useful for debugging
(e.g., indexing a pack coming over the network).
It might make sense to register them as tempfiles in the usual way,
possibly with an environment variable option to ask for them to be kept
(for debugging).
That's not foolproof, since a process can die without cleaning up after
itself (e.g., on system crash). But it would mean that a repeatedly
failing "git repack -ad" does not fill up the disk. And the decision of
when to clean up tempfiles in git-gc is less important.
> But the issue is how you tell which one of these "malformed" files
> are still being written and will be finalized, and which ones are
> leftover ones. You want to remove the latter without molesting the
> former. And you want to do so in a portable way, possibly even
> across the network file systems.
Yeah, I think there are two issues being discussed in this thread:
- when to clean up leftover tempfiles
- how to decide which tempfiles are leftover
The second one is what the OP mentioned for locking. But not only does
that have portability questions, I'm not sure it is sufficient. Would we
ever write tmp_pack_*, complete our process, and then expect our caller
to do something with it (meaning there's a race where no process is
holding the lock)?
I'm not sure. We definitely write "tmp" packfiles via pack-objects and
expect git-repack to move them to their final names. I think we use a
slightly different name ("tmp-<pid>-pack-*"), but arguably we should
consider cleaning up stale versions of those, too.
-Peff
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: git gc does not clean tmp_pack* files
2024-12-20 9:05 ` Jeff King
@ 2024-12-21 1:17 ` Boomman
2024-12-28 19:44 ` Jeff King
0 siblings, 1 reply; 8+ messages in thread
From: Boomman @ 2024-12-21 1:17 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
For me, two "git gc" on a same repo fail to run:
fatal: gc is already running on machine 'WIN-blah' pid 40304 (use
--force if not)
If you're already colliding on this, then I don't see why you can't
use a normal looking name without a randomized string like
"tmp_garbagecollecting", so that each execution would at least
overwrite the same location. In this case --force could append _1
probably.
-Vitaly
On Fri, Dec 20, 2024 at 1:05 AM Jeff King <peff@peff.net> wrote:
>
> On Thu, Dec 19, 2024 at 03:17:01AM -0800, Junio C Hamano wrote:
>
> > Boomman <boomman37@gmail.com> writes:
> >
> > > Yes, if the behavior in case of running out of disk space is to just
> > > leave the malformed file there, it stands to reason that cleaning up
> > > those malformed files should be the first operation to do for gc.
> >
> > It is misleading to call them malformed, isn't it? When a Git
> > process creates a packfile (or loose object file for that matter),
> > they are written under these tmp_* names. When the processes die
> > without finalizing these (either removing or renaming into their
> > final names), they are left behind, and it would be better if we can
> > remove it _before_ another process wants to consume more disk space.
>
> We usually automatically clean up our tempfiles if we encounter an
> error, but don't do so for partially written packs. I think this is
> mostly historical, though occasionally it can be useful for debugging
> (e.g., indexing a pack coming over the network).
>
> It might make sense to register them as tempfiles in the usual way,
> possibly with an environment variable option to ask for them to be kept
> (for debugging).
>
> That's not foolproof, since a process can die without cleaning up after
> itself (e.g., on system crash). But it would mean that a repeatedly
> failing "git repack -ad" does not fill up the disk. And the decision of
> when to clean up tempfiles in git-gc is less important.
>
> > But the issue is how you tell which one of these "malformed" files
> > are still being written and will be finalized, and which ones are
> > leftover ones. You want to remove the latter without molesting the
> > former. And you want to do so in a portable way, possibly even
> > across the network file systems.
>
> Yeah, I think there are two issues being discussed in this thread:
>
> - when to clean up leftover tempfiles
>
> - how to decide which tempfiles are leftover
>
> The second one is what the OP mentioned for locking. But not only does
> that have portability questions, I'm not sure it is sufficient. Would we
> ever write tmp_pack_*, complete our process, and then expect our caller
> to do something with it (meaning there's a race where no process is
> holding the lock)?
>
> I'm not sure. We definitely write "tmp" packfiles via pack-objects and
> expect git-repack to move them to their final names. I think we use a
> slightly different name ("tmp-<pid>-pack-*"), but arguably we should
> consider cleaning up stale versions of those, too.
>
> -Peff
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: git gc does not clean tmp_pack* files
2024-12-21 1:17 ` Boomman
@ 2024-12-28 19:44 ` Jeff King
2024-12-28 20:13 ` Boomman
0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2024-12-28 19:44 UTC (permalink / raw)
To: Boomman; +Cc: Junio C Hamano, git
On Fri, Dec 20, 2024 at 05:17:50PM -0800, Boomman wrote:
> For me, two "git gc" on a same repo fail to run:
> fatal: gc is already running on machine 'WIN-blah' pid 40304 (use
> --force if not)
>
> If you're already colliding on this, then I don't see why you can't
> use a normal looking name without a randomized string like
> "tmp_garbagecollecting", so that each execution would at least
> overwrite the same location. In this case --force could append _1
> probably.
git-gc is not the only thing that writes packs. There might be
simultaneous packs written by incoming pushes or fetches, for example.
-Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git gc does not clean tmp_pack* files
2024-12-28 19:44 ` Jeff King
@ 2024-12-28 20:13 ` Boomman
0 siblings, 0 replies; 8+ messages in thread
From: Boomman @ 2024-12-28 20:13 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
Right, but you know the *intent* of why each pack was created, right?
This is push, this is pull, this is gc. Clearly some of those are
expected to not run in parallel in normal scenarios, like gc. I
imagine there are more: fetch from a single remote, push to a single
remote? Why not make their packs names contain an operation identifier
that's supposed to be unique by default instead of a random string:
tmp_fetch_myremote, tmp_push_myotherremote. This way you can reduce
the number of trash from failed operations, since the same location is
going to be overwritten each call, plus the folder contents will be
much more comprehensible at a glance.
-Vitaly
On Sat, Dec 28, 2024 at 11:44 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Dec 20, 2024 at 05:17:50PM -0800, Boomman wrote:
>
> > For me, two "git gc" on a same repo fail to run:
> > fatal: gc is already running on machine 'WIN-blah' pid 40304 (use
> > --force if not)
> >
> > If you're already colliding on this, then I don't see why you can't
> > use a normal looking name without a randomized string like
> > "tmp_garbagecollecting", so that each execution would at least
> > overwrite the same location. In this case --force could append _1
> > probably.
>
> git-gc is not the only thing that writes packs. There might be
> simultaneous packs written by incoming pushes or fetches, for example.
>
> -Peff
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-12-28 20:13 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-19 2:19 git gc does not clean tmp_pack* files Boomman
2024-12-19 5:42 ` Jeff King
2024-12-19 8:26 ` Boomman
2024-12-19 11:17 ` Junio C Hamano
2024-12-20 9:05 ` Jeff King
2024-12-21 1:17 ` Boomman
2024-12-28 19:44 ` Jeff King
2024-12-28 20:13 ` Boomman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).