From: mkoegler@auto.tuwien.ac.at (Martin Koegler)
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org
Subject: Re: RFC: [PATCH] Support incremental pack files
Date: Mon, 26 Feb 2007 22:45:52 +0100 [thread overview]
Message-ID: <20070226214552.GA13402@auto.tuwien.ac.at> (raw)
In-Reply-To: <7vfy8x9tvo.fsf@assigned-by-dhcp.cox.net>
On Fri, Feb 23, 2007 at 12:10:35AM -0800, Junio C Hamano wrote:
> mkoegler@auto.tuwien.ac.at (Martin Koegler) writes:
>
> > Commiting a new version in GIT increases the storage by the compressed
> > size of each changed blob. Packing all unpacked objects decreases the
> > required storage, but does not generate deltas against objects in
> > packs. You need to repack all objects to get around this.
> >
> > For normal source code, this is not a problem. But if you want to use
> > git for big files, you waste storage (or CPU time for everything
> > repacking).
>
> Three points that might help you without any code change.
>
> - Have you run "git repack -a -d" without "-f"? Reusing of
> existing delta is specifically designed to avoid the "CPU
> time for everything repacking" problem.
>
> - If you are dealing with something other than "normal source
> code", do you know if your objects delta against each other
> well? If not, turning core.legacyheaders off might be a
> win. It allows the objects that are recorded as non-delta in
> resulting pack to be copied straight from loose objects.
I currently use CVS to save the daily changes in database dumps (files
mostly containing INSERT INTO xx (...) VALUES (...);). I'm trying to
switch this to git.
A commit typically consists of some files with a size of > 100 MB and
are growing every day. (All unpacked blob objects of) A commit require
currently about 60 MB. A incremental pack file containing one commit
is smaller than 1 MB, so the delta works well.
> - Once you accumulated large enough packs with existing
> objects, marking them with .keep would leave them untouched
> during subsequent repack. When "git repack -a -d" repacks
> "everything", its definition of "everything" becomes "except
> things that are in packs marked with .keep files".
>
> Side note: Is the .keep mechanism sufficiently documented? I am
> too lazy to check that right now, but here is a tip. After
> releasing the big one, line v1.5.0, I do:
I have not found any notice of this in the git documentation.
> $ P=.git/objects/pack
> $ git rev-list --objects v1.5.0 |
> git pack-objects --delta-base-offset \
> --depth=30 --window=100 --no-reuse-delta pack
> ...
> 6fba5cb8ed92dfef71ff47def9f95fa1e703ba59
> $ mv pack-6fba5cb8ed92dfef71ff47def9f95fa1e703ba59.* $P/
> $ echo 'Post 1.5.0' >$P/pack-6fba5cb8ed92dfef71ff47def9f95fa1e703ba59.keep
> $ git gc --prune
>
> This does three things:
>
> - It packs everything reachable from v1.5.0 with delta chain
> that is deeper than the default.
>
> - The pack is installed in the object store; the presence of
> .keep file (the contents of it does not matter) tells
> subsequent repack not to touch it.
>
> - Then the remaining objects are packed into different pack.
>
> With this, the repository uses two packs, one is what I'll keep
> until it's time to do the big repack again, another is what's
> constantly recreated by repacking but contains only "recent"
> object.
This could be a practical solution for me. The biggest disadvantage
of this solution is, that each pack file is at least >= 60 MB.
A nice feature of git is, that it normally does not change files,
which keeps incremental backups small. I want to retain this, so I
want avoid uncessary repacking.
As I have no tags, I can base the repacking decision only on file
size:
* Daily: Mark all packs >= eg. 100 MB as keep and repack the
repository.
* Weekly/Monthly/Yearly: repack repository including packs of the
next size class.
My first idea was to write a script, which delete all keep files,
recreates them for packs bigger than a specified size and the starts
git-repack.
As git-repack already calls find, this could be easly added to the
script:
--- git-repack 2007-02-17 18:06:09.000000000 +0100
+++ git-repack1 2007-02-26 22:09:12.000000000 +0100
@@ -8,11 +8,12 @@
. git-sh-setup
no_update_info= all_into_one= remove_redundant=
-local= quiet= no_reuse_delta= extra=
+local= quiet= no_reuse_delta= extra= sizearg=
while case "$#" in 0) break ;; esac
do
case "$1" in
-n) no_update_info=t ;;
+ -s) sizearg="-size -${2}k" ; shift; ;;
-a) all_into_one=t ;;
-d) remove_redundant=t ;;
-q) quiet=-q ;;
@@ -46,7 +47,7 @@
;;
,t,)
if [ -d "$PACKDIR" ]; then
- for e in `cd "$PACKDIR" && find . -type f -name '*.pack' \
+ for e in `cd "$PACKDIR" && find . -type f $sizearg -name '*.pack' \
| sed -e 's/^\.\///' -e 's/\.pack$//'`
do
if [ -e "$PACKDIR/$e.keep" ]; then
> > It only permits, that the base commit of a delta is located in a
> > different pack or as unpacked object.
>
> This "only" change needs to be done _very_ carefully, since
> self-containedness of pack files is one of the important
> elements of the stability of a git repository.
I understand the problems. GIT would need at least a list of external
base objects in the pack to speed up things like eg. git-prune.
mfg Martin Kögler
next prev parent reply other threads:[~2007-02-26 21:46 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-23 7:13 RFC: [PATCH] Support incremental pack files Martin Koegler
2007-02-23 8:10 ` Junio C Hamano
2007-02-26 21:45 ` Martin Koegler [this message]
2007-02-26 22:03 ` Johannes Schindelin
2007-02-23 16:04 ` Nicolas Pitre
2007-02-23 16:32 ` Shawn O. Pearce
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070226214552.GA13402@auto.tuwien.ac.at \
--to=mkoegler@auto.tuwien.ac.at \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).