From: mkoegler@auto.tuwien.ac.at (Martin Koegler)
To: Dana How <danahow@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <junkio@cox.net>
Subject: Re: [PATCH] git-pack-objects: cache small deltas between big objects
Date: Mon, 21 May 2007 19:59:50 +0200 [thread overview]
Message-ID: <20070521175950.GA13818@auto.tuwien.ac.at> (raw)
In-Reply-To: <56b7f5510705202135s8c9cd9qf4489b2b5bb2e264@mail.gmail.com>
On Sun, May 20, 2007 at 09:35:56PM -0700, Dana How wrote:
> On 5/20/07, Martin Koegler <mkoegler@auto.tuwien.ac.at> wrote:
> >Creating deltas between big blobs is a CPU and memory intensive task.
> >In the writing phase, all (not reused) deltas are redone.
>
> Actually, just the ones selected, which is approx 1/window.
> Do you have any numbers describing the effects on runtime
> and memory size for a known repo like linux-2.6?
Objects below 1 MB are not considered for caching.
The linux kernel has only such objects:
linux.git$ find -size +1000k |grep -v ".git"|wc
0 0 0
So no caching happens. The required memory is only increased by the
new pointer in object_entry.
At runtime, we have additional (#object)*(window size+1) null pointer
checks, (#objects)*(window size) pointer initialiations with zero and
check (#objects)*(window size) times the caching policy check: ((src_size
>> 20) + (trg_size >> 21) > (delta_size >> 10))
Writing a cached delta is faster, as we avoid creating a delta. Some
calls to free are delayed.
> >This patch adds support for caching deltas from the deltifing phase, so
> >that that the writing phase is faster.
> >
> >The caching is limited to small deltas to avoid increasing memory usage
> >very much.
> >The implemented limit is (memory needed to create the delta)/1024.
>
> Your limit is applied per-object, and there is no overall limit
> on the amount of memory not freed in the delta phase.
> I suspect this caching would be disastrous for the large repo
> with "megablobs" I'm trying to wrestle with at the moment.
http://www.spinics.net/lists/git/msg31241.html:
> At the moment I'm experimenting on a git repository with
> a 4.5GB checkout, and 18 months of history in 4K commits
> comprising 100GB (uncompressed) of blobs stored in
> 7 packfiles of 2GB or less. Hopefully I'll be able to say
> more about tweaking packing shortly.
I you have 100 GB of uncompressed data in your pack files, the cache
limit is between 100MB and 200MB with the current policy.
The aim of my patch is to speed up pack writing without increasing
memory usage very much, if you have blobs of some hundred MB size in
your repository.
The caching policy could be extended to speed more memory on caching
other deltas. Ideas on this topic are welcome.
mfg Martin Kögler
PS: If you are trying to optimize packing speed/size, you could test
the following patch: http://marc.info/?l=git&m=117908942525171&w=2
next prev parent reply other threads:[~2007-05-21 18:00 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-05-20 21:11 [PATCH] git-pack-objects: cache small deltas between big objects Martin Koegler
2007-05-21 4:35 ` Dana How
2007-05-21 17:59 ` Martin Koegler [this message]
2007-05-22 7:01 ` Dana How
2007-05-22 8:04 ` Junio C Hamano
2007-05-22 9:25 ` Dana How
2007-05-21 4:54 ` Junio C Hamano
2007-05-21 17:00 ` Martin Koegler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070521175950.GA13818@auto.tuwien.ac.at \
--to=mkoegler@auto.tuwien.ac.at \
--cc=danahow@gmail.com \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).