From: Sam Vilain <sam@vilain.net>
To: Dana How <danahow@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [PATCH] [RFC] Generational repacking
Date: Thu, 07 Jun 2007 14:28:20 +1200 [thread overview]
Message-ID: <46676D44.7070703@vilain.net> (raw)
In-Reply-To: <56b7f5510706061704r34692c49v994ff368bbc12d05@mail.gmail.com>
Dana How wrote:
> This patch complicates git-repack.sh quite a bit and
> I'm unclear on what _problem_ you're addressing.
The problem is simple, and it is partially in the eye of the beholder.
That is;
1. without repacking, you get a lot of loose objects.
- unnecessary disk space usage
- bad performance on many OSes
2. repack takes too long to run very regularly; it's an occasional
command.
3. the perception that git repositories are not maintenance free.
What I'm aiming for is something which is light enough that it might
even win back the performance loss you got from 1), and to solve the
perception problem of 3).
Much as users who don't like automatic database maintenance turn it off
and run it at the best time, advanced git users will want to disable
this feature in ~/.gitrc and run repack themselves when it suits them,
or via cron or whatever. Or it's disabled by default and users that
whine get told to turn it on, it really doesn't matter. I can already
do it with a commit hook, so I'm quite happy.
> The recent LRU preferred pack patch
> reduces much of the value in keeping a repository tidy
> ("tidy" == "few pack files").
Great, that is a good thing.
Pack files are an almost indistinguishable concept from database
partitions. In terms of that, scaling problems with lots of partitions
can be managed, certainly.
For instance with database partitioning you would expect your query
planner (in this case, read_packed_sha1()) to be able to select the
right partition (pack) to go to first to avoid excessive index lookups.
That a strategy for picking the best pack quickly N% of the time exists
for git is an excellent measure to reduce the impact of a large number
of pack files. I think you would probably find measurable wins by
ensuring that the gross number of packs is kept limited.
Consider that I'm thinking of running this generational repack somewhere
such as a commit hook, if it found >100 loose objects, so that the first
generation repack is very quick and doesn't annoy me - and the second
generation will similarly be fairly quick as many deltas will already be
computed. The exact behaviour will probably require tuning to get a
good balance between good delta computation and minimal interruption to
commit flow. Someone on IRC floated the idea of making the first
generation do no delta computation to make it lightning fast.
Note that if you had 3 pack generations, only the first two levels will
ever be repacked - you'll end up with an unlimited number of third
generation packs, which will also end up in LRP* order.
> Already git-gc calls git-repack -a -d. How do you plan to change this?
> I wonder if you should be making git-gc more intelligent instead.
>
> Also, you introduce a new pack properties file (.gen) which seems
> awkward to me.
This implementation is a simple demonstration of the logic which was
designed to communicate the idea and stimulate discussion. I think the
logic could probably go elsewhere too, and yes the new file is a bit of
a hack.
It might be better to base the "generation" assessment of the file on
the actual size of the pack, for instance - ie, Instead of the number of
loose objects, the size of the loose objects, call 1st generation = <1MB
pack, 2nd generation = <5MB, etc. When the combined size of 1st
generation packs gets above 5MB then that generation is full and a new
2nd generation pack is made. Then no state file is required.
> Perhaps something like this would be useful on a huge repository
> under active use. But delta re-use makes full repacking quite quick for
> a reasonably-sized repository already, and I don't see this being very useful
> for a repository which is large due to large objects.
I agree with your point of view, however I think if the feature is out
there but disabled by default then this can be found through experience.
As you can see all of the elements to implement it are already there -
and as you mention, combining packs is already quick.
Sam.
* Last Recently Packed ;)
next prev parent reply other threads:[~2007-06-07 2:28 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-06-06 11:08 [PATCH] [RFC] Generational repacking Sam Vilain
2007-06-06 22:46 ` Junio C Hamano
2007-06-06 22:53 ` Sam Vilain
2007-06-07 0:04 ` Dana How
2007-06-07 2:28 ` Sam Vilain [this message]
2007-06-07 3:20 ` Nicolas Pitre
2007-06-07 5:13 ` Sam Vilain
2007-06-07 13:38 ` Nicolas Pitre
2007-06-07 21:29 ` Sam Vilain
2007-06-07 19:46 ` Martin Langhoff
2007-06-07 21:36 ` Sam Vilain
2007-06-07 22:51 ` Martin Langhoff
2007-06-07 3:05 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46676D44.7070703@vilain.net \
--to=sam@vilain.net \
--cc=danahow@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).