* A naive proposal for preventing loose object explosions
@ 2013-09-06 3:42 mfick
2013-09-06 17:19 ` Junio C Hamano
0 siblings, 1 reply; 3+ messages in thread
From: mfick @ 2013-09-06 3:42 UTC (permalink / raw)
To: git; +Cc: nasserg
I am imagining what I consider to be a naive approach to preventing
loose unreachable object explosions. It may seem a bit heavy handed
at first, but every conversation so far about this issue seems to have
died, so I am looking for a simple incremental improvement to what we
have today. I theorize that this approach will provide the same
protections (good and bad) against races as using git-repack -A -d
and git-prune --expire <time> regularly will today.
1a) Add --prune-packed option to git-repack to force a call to git
prune-packed, without having to specify the -d option to git-repack.
1b) Add a --keep <marker> option to git-repack which will create a
keep file with "marker" in it for existing pack files which were
repacked (not to the new pack).
1c) Now instead of running:
git-repack -A -d
run:
git-repack --prune-packed --keep 'prune-when-expired'
This should effectively keep a duplicate copy of all old packfiles
around, but the new pack file will not have unreferenced objects in
it. This is similar to having unreachable loose objects left around,
but it also keeps around extra copy(ies) of reachable objects wasting
some disk space. While this will normally consume more disk space in
pack files, it will not explode loose objects, which will likely save
a lot of space when such explosions would have occured. Of course,
this should also prevent the severe performance downsides to these
explosions. Object lookups should likely not get any slower than if
repack were not run, and the extra new pack might actually help
find some objects quicker. Safety with respect to unreachable object
race conditions should be the same as using git repack -A -d since at
least one copy of every object should be kept around during this run?
Then:
2a) Add support for passing in a list of pack files to git-repack.
This list will then be used as the original "existing" list instead
of finding all packfiles without keeps.
2b) Add an --expire-marked <marker> option to git-prune which will
find any pack files with a .keep with "marker" in it, and evaluate if
it meets the --expire time. If so, it will also call:
git-repack -a -d <expired-pack-files>...
This should repack any reachable objects from the <expired-pack-files>
into a single new pack file. This may again cause some reachable
object duplication (likely with the same performance affects as the
first git-repack phase above), but unreachable objects from <expired-
pack-files> will now have been pruned as they would have been if they
had originally been turned into loose objects.
3) Finally on the next repack cycle the current duplicated reachable
objects should likely get fully reconsolidated into a single copy.
Does this sound like it would work? I may attempt to construct this
for internal use (since it is a bit hacky). It feels like it could be
done mostly with some simple shell modding/wrapping (feels less scary than
messing with the core C tools). I wonder if I a missing some obvious flaw
to this approach?
Thanks for any insights,
-Martin
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: A naive proposal for preventing loose object explosions
2013-09-06 3:42 A naive proposal for preventing loose object explosions mfick
@ 2013-09-06 17:19 ` Junio C Hamano
2013-09-06 18:12 ` Martin Fick
0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2013-09-06 17:19 UTC (permalink / raw)
To: mfick; +Cc: git, nasserg
mfick@codeaurora.org writes:
> Object lookups should likely not get any slower than if
> repack were not run, and the extra new pack might actually help
> find some objects quicker.
In general, having an extra pack, only to keep objects that you know
are available in other packs, will make _all_ object accesses, not
just the ones that are contained in that extra pack, slower.
Instead of mmapping all the .idx files for all the available
packfiles, we could build a table that records, for each packed
object, from which packfile at what offset the data is available to
optimize the access, but obviously building that in-core table will
take time, so it may not be a good trade-off to do so at runtime (a
precomputed super-.idx that we can mmap at runtime might be a good
way forward if that turns out to be the case).
> Does this sound like it would work?
Sorry, but it is unclear what problem you are trying to solve.
Is it that you do not like that "repack -A" ejects unreferenced
objects and makes it loose, which you may have many?
The loosen_unused_packed_objects() function used by "repack -A"
calls the force_object_loose() function (actually, it is the sole
caller of the function). If you tweak the latter to stream to a
single new "graveyard" packfile and mark it as "kept until expiry",
would it solve the issue the same way but with much smaller impact?
There already is an infrastructure available to open a single output
packfile and send multiple objects to it in bulk-checkin.c, and I am
wondering if you can take advantage of the framework. The existing
interface to it assumes that the object data is coming from a file
descriptor (the interface was built to support bulk-checkin of many
objects in an empty repository), and it needs refactoring to allow
stream_to_pack() to take different kind of data sources in the form
of stateful callback function, though.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: A naive proposal for preventing loose object explosions
2013-09-06 17:19 ` Junio C Hamano
@ 2013-09-06 18:12 ` Martin Fick
0 siblings, 0 replies; 3+ messages in thread
From: Martin Fick @ 2013-09-06 18:12 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, nasserg
On Friday, September 06, 2013 11:19:02 am Junio C Hamano
wrote:
> mfick@codeaurora.org writes:
> > Object lookups should likely not get any slower than if
> > repack were not run, and the extra new pack might
> > actually help find some objects quicker.
>
> In general, having an extra pack, only to keep objects
> that you know are available in other packs, will make
> _all_ object accesses, not just the ones that are
> contained in that extra pack, slower.
My assumption was that if the new pack, with all the
consolidated reachable objects in it, happens to be searched
first, it would actually speed things up. And if it is
searched last, then the objects weren't in the other packs
so how could it have made it slower? It seems this would
only slow down the missing object path?
But it sounds like all the index files are mmaped up front?
Then yes, I can see how it would slow things down. However,
it is one only extra (hopefully now well optimized) pack.
My base assumption was that even if it does slow things
down, it would likely be unmeasurable and a price worth
paying to avoid an extreme penalty.
> Instead of mmapping all the .idx files for all the
> available packfiles, we could build a table that
> records, for each packed object, from which packfile at
> what offset the data is available to optimize the
> access, but obviously building that in-core table will
> take time, so it may not be a good trade-off to do so at
> runtime (a precomputed super-.idx that we can mmap at
> runtime might be a good way forward if that turns out to
> be the case).
>
> > Does this sound like it would work?
>
> Sorry, but it is unclear what problem you are trying to
> solve.
I think you guessed it below, I am trying to prevent loose
object explosions by keeping unreachable objects around in
packs (instead of loose) until expiry. With the current way
that pack-objects works, this is the best I could come up
with (I said naive). :(
Today the git-repack calls git pack-objects like this:
git pack-objects --keep-true-parents --honor-pack-keep --
non-empty --all --reflog $args </dev/null "$PACKTMP"
This has no mechanism to place unreachable objects in a
pack. If git pack-objects supported an option which
streamed them to a separate file (as you suggest below),
that would likely be the main piece needed to avoid the
heavy-handed approach I was suggesting.
The problem is how to define the interface for this? How do
we get the filename of the new unreachable packfile? Today
the name of the new packfile is sent to stdout, would we
just tack on another name? That seems like it would break
some assumptions? Maybe it would be OK if it only did that
when an --unreachable flag was added? Then git-repack could
be enhanced to understand that flag and the extra filenames
it outputs?
> Is it that you do not like that "repack -A" ejects
> unreferenced objects and makes it loose, which you may
> have many?
Yes, several times a week we have people pushing the kernel
to wrong projects, this leads to 4M loose objects. :(
Without a solution for this regular problem, we are very
scared to move our repos off of SSDs. This leads to hour
plus long fetches.
> The loosen_unused_packed_objects() function used by
> "repack -A" calls the force_object_loose() function
> (actually, it is the sole caller of the function). If
> you tweak the latter to stream to a single new
> "graveyard" packfile and mark it as "kept until expiry",
> would it solve the issue the same way but with much
> smaller impact?
Yes.
> There already is an infrastructure available to open a
> single output packfile and send multiple objects to it
> in bulk-checkin.c, and I am wondering if you can take
> advantage of the framework. The existing interface to
> it assumes that the object data is coming from a file
> descriptor (the interface was built to support
> bulk-checkin of many objects in an empty repository),
> and it needs refactoring to allow stream_to_pack() to
> take different kind of data sources in the form of
> stateful callback function, though.
That feels beyond what I could currently dedicate the time
to do. Like I said, my solution is heavy handed but it felt
simple enough for me to try. I can spare the extra disk
space and I am not convinced the performance hit would be
bad. I would, of course, be delighted if someone else were
to do what you suggest, but I get that it's my itch...
-Martin
--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-09-06 18:12 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-06 3:42 A naive proposal for preventing loose object explosions mfick
2013-09-06 17:19 ` Junio C Hamano
2013-09-06 18:12 ` Martin Fick
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).